A Study on Transformer Configuration and Training Objective

Fuzhao Xue; Jianghai Chen; Aixin Sun; Xiaozhe Ren; Zangwei Zheng; Xiaoxin He; Yongming Chen; Xin Jiang; Yang You

変圧器の構成とトレーニングの目的に関する研究

Transformer ベースのモデルは、多くのタスク、特に視覚と言語のタスクで印象的な結果をもたらしました。多くのモデルトレーニングの状況では、通常、従来の構成が採用されます。たとえば、隠れた寸法 (モデルの幅) を 768 に設定し、トランス層の数 (モデルの深さ) を 12 に設定することがよくあります。このペーパーでは、これらの従来の構成を再検討します。理論的分析と実験的評価を通じて、マスクされたオートエンコーダーがディープトランスフォーマートレーニングにおける過度の平滑化の問題を軽減するのに効果的であることを示します。この発見に基づいて、マスクされたオートエンコーダーのトレーニングに、より深く狭いトランスフォーマー構成を使用するというアイデアである Bamboo を提案します。 ImageNet では、構成をこのように簡単に変更するだけで、再設計されたモデルは 87.1% のトップ 1 精度を達成し、MAE や BEiT などの SoTA モデルを上回ります。言語タスクでは、再設計されたモデルは、GLUE データセット上でデフォルト設定の BERT よりも平均 1.1 ポイント優れています。

Transformer-based models have delivered impressive results on many tasks, particularly vision and language tasks. In many model training situations, conventional configurations are typically adopted. For example, we often set the base model with hidden dimensions (i.e. model width) to be 768 and the number of transformer layers (i.e. model depth) to be 12. In this paper, we revisit these conventional configurations. Through theoretical analysis and experimental evaluation, we show that the masked autoencoder is effective in alleviating the over-smoothing issue in deep transformer training. Based on this finding, we propose Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy and outperforms SoTA models like MAE and BEiT. On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average, on GLUE datasets.

updated: Thu May 18 2023 16:08:10 GMT+0000 (UTC)

published: Sat May 21 2022 05:17:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト