Tutel: Adaptive Mixture-of-Experts at Scale

Changho Hwang; Wei Cui; Yifan Xiong; Ziyue Yang; Ze Liu; Han Hu; Zilong Wang; Rafael Salas; Jithin Jose; Prabhat Ram; Joe Chau; Peng Cheng; Fan Yang; Mao Yang; Yongqiang Xiong

Tutel: 適応性のある大規模な専門家の混合

スパースリーゲート混合専門家 (MoE) は、固定の計算コストで深層学習モデルを数兆個以上のパラメーターに拡張するために広く採用されています。 MoE のアルゴリズムパフォーマンスは、各入力トークンを適切なサブモデルまたはエキスパートに転送するトークンルーティングメカニズムに依存しています。トークンルーティングは、実行時にエキスパートワークロードの量を動的に決定しますが、既存のシステムは、動的なワークロードに適応しない静的な実行、つまり静的な並列処理とパイプライン処理により、非効率な計算に悩まされます。動的に適応する並列処理とパイプライン処理を備えた MoE 向けの拡張性の高いスタック設計と実装である Flex を紹介します。 Flex は、MoE モデルのパラメーターと入力データを分散するための同一のレイアウトを設計します。これは、数学的不等価性やテンソル移行のオーバーヘッドを発生させることなく、考えられるすべての並列処理またはパイプライン手法で活用できます。これにより、実行時にゼロコストで適応型並列処理/パイプライン最適化が可能になります。この主要な設計に基づいて、Flex はさまざまな MoE アクセラレーション技術も実装します。すべての技術を集約した Flex は、最終的に、あらゆる規模で大幅な高速化を実現します。これは、以前の最先端技術と比較して、16 基および 2,048 基の A100 GPU 上で単一 MoE レイヤーでそれぞれ 4.96 倍および 5.75 倍の高速化です。私たちの評価では、Flex が、最先端のコンピュータービジョンアーキテクチャである Swin Transformer V2 に基づいて構築された、SwinV2-MoE という名前の実世界の MoE ベースのモデルを効率的かつ効果的に実行していることがわかりました。効率に関しては、Flex は SwinV2-MoE を加速し、Fairseq と比較してトレーニングと推論でそれぞれ最大 1.55 倍と 2.11 倍の高速化を達成します。有効性に関しては、SwinV2-MoE モデルは、事前トレーニングと、COCO オブジェクト検出などの下流のコンピュータービジョンタスクの両方において、対応する高密度モデルよりも優れた精度を達成しており、Flex がエンドツーエンドの実世界モデルトレーニングに対応していることを示しています。そして推理。

Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost. The algorithmic performance of MoE relies on its token routing mechanism that forwards each input token to the right sub-models or experts. While token routing dynamically determines the amount of expert workload at runtime, existing systems suffer inefficient computation due to their static execution, namely static parallelism and pipelining, which does not adapt to the dynamic workload. We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Flex designs an identical layout for distributing MoE model parameters and input data, which can be leveraged by all possible parallelism or pipelining methods without any mathematical inequivalence or tensor migration overhead. This enables adaptive parallelism/pipelining optimization at zero cost during runtime. Based on this key design, Flex also implements various MoE acceleration techniques. Aggregating all techniques, Flex finally delivers huge speedup at any scale -- 4.96x and 5.75x speedup of a single MoE layer over 16 and 2,048 A100 GPUs, respectively, over the previous state-of-the-art. Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Flex accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Flex for end-to-end real-world model training and inference.

updated: Mon Jun 05 2023 15:05:24 GMT+0000 (UTC)

published: Tue Jun 07 2022 15:20:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト