DropDim: A Regularization Method for Transformer Networks

Hao Zhang; Dan Qu; Keji Shao; Xukui Yang

DropDim: 変圧器ネットワークの正則化方法

トランスフォーマーの重要なコンポーネントである自己注意メカニズムを正則化するために設計された構造化ドロップアウトメソッドである、DropDim を紹介します。ランダムにニューロンを削除する一般的なドロップアウトメソッドとは対照的に、DropDim は埋め込み次元の一部を削除します。このようにして、セマンティック情報を完全に破棄することができます。したがって、異なる埋め込み次元間の過度の共適応が壊れる可能性があり、自己注意は、特定の数の埋め込み次元が消去された意味のある特徴をエンコードすることを余儀なくされます。 MUST-C English-Germany データセットで実行されたさまざまなタスクに関する実験では、DropDim がモデルのパフォーマンスを効果的に改善し、過剰適合を減らし、他の正則化方法との補完的な効果を示すことが示されています。ラベルスムージングと組み合わせると、ASR タスクで WER を 19.1% から 15.1% に削減でき、MT タスクで BLEU 値を 26.90 から 28.38 に増やすことができます。 ST タスクでは、モデルは 22.99 の BLEU スコアに達することができ、強力なベースラインと比較して 1.86 BLEU ポイント増加します。

We introduceDropDim, a structured dropout method designed for regularizing the self-attention mechanism, which is a key component of the transformer. In contrast to the general dropout method, which randomly drops neurons, DropDim drops part of the embedding dimensions. In this way, the semantic information can be completely discarded. Thus, the excessive coadapting between different embedding dimensions can be broken, and the self-attention is forced to encode meaningful featureswith a certain number of embedding dimensions erased. Experiments on a wide range of tasks executed on the MUST-C English-Germany dataset show that DropDim can effectively improve model performance, reduce over-fitting, and show complementary effects with other regularization methods. When combined with label smoothing, the WER can be reduced from 19.1% to 15.1% on the ASR task, and the BLEU value can be increased from26.90 to 28.38 on the MT task. On the ST task, the model can reach a BLEU score of 22.99, an increase by 1.86 BLEU points compared to the strong baseline.

updated: Thu Apr 20 2023 13:54:18 GMT+0000 (UTC)

published: Thu Apr 20 2023 13:54:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト