Deeper vs Wider: A Revisit of Transformer Configuration

Fuzhao Xue; Jianghai Chen; Aixin Sun; Xiaozhe Ren; Zangwei Zheng; Xiaoxin He; Xin Jiang; Yang You

より深いvsより広い：変圧器構成の再考

Transformerベースのモデルは、多くのタスク、特にビジョンと言語のタスクで印象的な結果をもたらしました。多くのモデルトレーニング状況では、通常、従来の構成が採用されます。たとえば、隠れた次元（つまりモデルの幅）を持つベースモデルを768に設定し、変圧器の層の数（つまりモデルの深さ）を12に設定することがよくあります。このペーパーでは、これらの従来の構成を再検討します。理論的分析と実験的評価を通じて、マスクされたオートエンコーダが深い変圧器トレーニングにおける過度の平滑化の問題を軽減するのに効果的であることを示します。この発見に基づいて、マスクされたオートエンコーダのトレーニングに、より深く、より狭いトランス構成を使用するというアイデアであるBambooを提案します。 ImageNetでは、構成をこのように簡単に変更することで、再設計されたモデルは87.1％のトップ1精度を達成し、MAEやBEiTなどのSoTAモデルよりも優れています。言語タスクでは、再設計されたモデルは、GLUEデータセットで、デフォルト設定のBERTよりも平均1.1ポイント優れています。

Transformer-based models have delivered impressive results on many tasks, particularly vision and language tasks. In many model training situations, conventional configurations are typically adopted. For example, we often set the base model with hidden dimensions (i.e. model width) to be 768 and the number of transformer layers (i.e. model depth) to be 12. In this paper, we revisit these conventional configurations. Through theoretical analysis and experimental evaluation, we show that the masked autoencoder is effective in alleviating the over-smoothing issue in deep transformer training. Based on this finding, we propose Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy and outperforms SoTA models like MAE and BEiT. On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average, on GLUE datasets.

updated: Tue May 24 2022 08:03:25 GMT+0000 (UTC)

published: Sat May 21 2022 05:17:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト