Tutel: Adaptive Mixture-of-Experts at Scale

Changho Hwang; Wei Cui; Yifan Xiong; Ziyue Yang; Ze Liu; Han Hu; Zilong Wang; Rafael Salas; Jithin Jose; Prabhat Ram; Joe Chau; Peng Cheng; Fan Yang; Mao Yang; Yongqiang Xiong

Tutel：大規模な適応型の専門家の混合

近年、Mixture-of-Experts（MoE）は、スパース計算によって計算コストを削減しながら、モデル容量を数兆以上のパラメーターに拡張できるディープラーニングの有望な手法として登場しました。 MoEは非常に大きなモデルの新しいフロンティアを開きますが、MoEの動的な性質とシステムの静的な並列処理/パイプライン化の不一致により、数千のGPUでの実装は制限されています。動的に適応する並列処理とパイプライン化を備えたMoEの拡張性の高いスタック設計と実装であるTutelを紹介します。 Tutelは、実行時に適応型並列処理スイッチングと適応型パイプラインを提供し、それぞれ最大1.74倍と2.00倍の単一MoEレイヤーの高速化を実現します。また、MoE通信の高速化のための新しい2次元階層アルゴリズムを提案します。これは、2,048GPUで最大20.7倍の以前の最先端技術を上回ります。すべての手法を集約すると、Tutelは最終的にFairseq：MetaのFacebook AI Research Sequence-to-Sequence Toolkit（Tutelは現在Fairseqに部分的に採用されています）を介して、16GPUと2,048GPUでそれぞれ4.96倍と5.75倍の単一MoEレイヤーの高速化を実現します。 Tutelのソースコードは公開されています：https：//github.com/microsoft/tutel。私たちの評価によると、Tutelは、最先端のコンピュータービジョンアーキテクチャであるSwin Transformer V2に基づいて構築された、SwinV2-MoEという名前の実際のMoEベースのモデルを効率的かつ効果的に実行します。効率に関しては、TutelはSwinV2-MoEを高速化し、Fairseqよりもトレーニングと推論をそれぞれ最大1.55倍と2.11倍高速化します。有効性に関して、SwinV2-MoEモデルは、対応する高密度モデルよりも、事前トレーニングとCOCOオブジェクト検出などのダウンストリームコンピュータービジョンタスクの両方で優れた精度を実現し、エンドツーエンドの実世界モデルトレーニングに対するTutelの準備ができていることを示します。と推論。 SwinV2-MoEはhttps://github.com/microsoft/Swin-Transformerでオープンソース化されています。

In recent years, Mixture-of-Experts (MoE) has emerged as a promising technique for deep learning that can scale the model capacity to trillion-plus parameters while reducing the computing cost via sparse computation. While MoE opens a new frontier of exceedingly large models, its implementation over thousands of GPUs has been limited due to mismatch between the dynamic nature of MoE and static parallelism/pipelining of the system. We present Tutel, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Tutel delivers adaptive parallelism switching and adaptive pipelining at runtime, which achieves up to 1.74x and 2.00x single MoE layer speedup, respectively. We also propose a novel two-dimensional hierarchical algorithm for MoE communication speedup that outperforms the previous state-of-the-art up to 20.7x over 2,048 GPUs. Aggregating all techniques, Tutel finally delivers 4.96x and 5.75x speedup of a single MoE layer on 16 GPUs and 2,048 GPUs, respectively, over Fairseq: Meta's Facebook AI Research Sequence-to-Sequence Toolkit (Tutel is now partially adopted by Fairseq). Tutel source code is available in public: https://github.com/microsoft/tutel . Our evaluation shows that Tutel efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Tutel accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Tutel for end-to-end real-world model training and inference. SwinV2-MoE is open sourced in https://github.com/microsoft/Swin-Transformer .

updated: Tue Jun 07 2022 15:20:20 GMT+0000 (UTC)

published: Tue Jun 07 2022 15:20:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト