Layer-wise Shared Attention Network on Dynamical System Perspective

Zhongzhan Huang; Senwei Liang; Mingfu Liang; Weiling He; Liang Lin

動的システムの観点からのレイヤーワイズ共有注意ネットワーク

注意ネットワークは、さまざまな視覚の問題で精度を向上させることに成功しています。以前の研究では、新しい自己注意モジュールの設計に重点が置かれ、モジュールをネットワークの各層に個別に接続する従来のパラダイムに従っていました。ただし、このようなパラダイムでは、レイヤー数の増加に伴い、追加のパラメーターコストが必然的に増加します。残差ニューラルネットワークの動的システムの観点から、同じステージのレイヤーからの特徴マップが均一であることがわかります。これは、密で暗黙的な注意 (DIA) ユニットと呼ばれる斬新でシンプルなフレームワークを提案するきっかけになります。、異なるネットワーク層全体で単一のアテンションモジュールを共有します。私たちのフレームワークでは、パラメーターのコストはレイヤーの数とは無関係であり、精巧なモデル作成なしでパラメーターを大幅に削減することで、既存の一般的な自己注意モジュールの精度をさらに向上させます。ベンチマークデータセットに関する広範な実験により、DIA はレイヤーごとの特徴の相互関係を強調できるため、画像分類、オブジェクト検出、医療アプリケーションなど、さまざまな視覚タスクが大幅に改善されることが示されています。さらに、DIA ユニットの有効性は、(1) 残余ニューラルネットワークのスキップ接続を削除する、(2) モデルのバッチ正規化を削除する、(3) すべてを削除することによってモデルトレーニングを不安定化する新しい実験によって実証されています。トレーニング中のデータ増強。これらの場合、DIAにはトレーニングを安定させる強力な正則化機能があることを確認します。つまり、この方法によって形成された密で暗黙的な接続は、層全体の情報通信と勾配の値を効果的に回復および強化し、トレーニングの不安定性を緩和します。 .

Attention networks have successfully boosted accuracy in various vision problems. Previous works lay emphasis on designing a new self-attention module and follow the traditional paradigm that individually plugs the modules into each layer of a network. However, such a paradigm inevitably increases the extra parameter cost with the growth of the number of layers. From the dynamical system perspective of the residual neural network, we find that the feature maps from the layers of the same stage are homogenous, which inspires us to propose a novel-and-simple framework, called the dense and implicit attention (DIA) unit, that shares a single attention module throughout different network layers. With our framework, the parameter cost is independent of the number of layers and we further improve the accuracy of existing popular self-attention modules with significant parameter reduction without any elaborated model crafting. Extensive experiments on benchmark datasets show that the DIA is capable of emphasizing layer-wise feature interrelation and thus leads to significant improvement in various vision tasks, including image classification, object detection, and medical application. Furthermore, the effectiveness of the DIA unit is demonstrated by novel experiments where we destabilize the model training by (1) removing the skip connection of the residual neural network, (2) removing the batch normalization of the model, and (3) removing all data augmentation during training. In these cases, we verify that DIA has a strong regularization ability to stabilize the training, i.e., the dense and implicit connections formed by our method can effectively recover and enhance the information communication across layers and the value of the gradient thus alleviate the training instability.

updated: Thu Oct 27 2022 13:24:08 GMT+0000 (UTC)

published: Thu Oct 27 2022 13:24:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト