Foundation Transformers

Hongyu Wang; Shuming Ma; Shaohan Huang; Li Dong; Wenhui Wang; Zhiliang Peng; Yu Wu; Payal Bajaj; Saksham Singhal; Alon Benhaim; Barun Patra; Zhun Liu; Vishrav Chaudhary; Xia Song; Furu Wei

財団トランスフォーマー

言語、ビジョン、音声、マルチモーダルにわたるモデルアーキテクチャの大規模な収束が出現しています。ただし、同じ名前の「トランスフォーマー」の下で、上記の領域はパフォーマンスを向上させるために異なる実装を使用します。トレーニングの安定性が保証されたさまざまなタスクやモダリティの頼りになるアーキテクチャとして機能する、真の汎用モデリングのための Foundation Transformer の開発を求めます。この作業では、目標を達成するために、Magneto という名前の Transformer バリアントを紹介します。具体的には、優れた表現力のために Sub-LayerNorm を提案し、安定したスケールアップのために DeepNet から理論的に導出された初期化戦略を提案します。広範な実験により、言語モデリング (BERT および GPT)、機械翻訳、視覚事前トレーニング (BEiT)、音声認識、およびマルチモーダル事前トレーニング (すなわち、BEiT-3)。

A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).

updated: Wed Oct 19 2022 11:03:35 GMT+0000 (UTC)

published: Wed Oct 12 2022 17:16:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト