Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets

Xiangyu Chen; Qinghao Hu; Kaidong Li; Cuncong Zhong; Guanghui Wang

小さなデータセットのビジョントランスフォーマーに蓄積された些細な注意事項

Vision Transformers は、マルチヘッドセルフアテンションモジュールとマルチレイヤーパーセプトロンを使用して長期的な依存関係をキャプチャする能力から恩恵を受けて、コンピュータービジョンタスクで競争力のあるパフォーマンスを発揮しました。ただし、グローバルアテンションの計算には、畳み込みニューラルネットワークと比較して別の欠点があります。つまり、収束するためにはるかに多くのデータと計算が必要になるため、実際のアプリケーションでは一般的な小さなデータセットで適切に一般化することが困難になります。以前の作業は、大規模なデータセットから知識を転送するか、小さなデータセットの構造を調整することに焦点を当てていました。自己注意モジュールを注意深く調べた後、些細な注意の重みの数が重要なものよりもはるかに多く、蓄積された些細な重みが、注意によって処理されない大量のためにビジョントランスフォーマーの注意を支配していることを発見しました。自体。これは有用な非自明な注意をカバーし、些細な注意がより多くのノイズを含む場合、たとえば一部のバックボーンの浅いレイヤーでパフォーマンスに悪影響を及ぼします。この問題を解決するために、アテンションの重みをしきい値によって自明なものと自明でないものに分割し、提案されたTrivial WeIghts Suppression Transformation（TWIST）によって蓄積されたTrivial Attention（SATA）の重みを抑制して、アテンションノイズを減らすことを提案しました。 CIFAR-100 および Tiny-ImageNet データセットでの広範な実験では、抑制方法によってビジョントランスフォーマーの精度が最大 2.3% 向上することが示されています。コードは https://github.com/xiangyu8/SATA で入手できます。

Vision Transformers has demonstrated competitive performance on computer vision tasks benefiting from their ability to capture long-range dependencies with multi-head self-attention modules and multi-layer perceptron. However, calculating global attention brings another disadvantage compared with convolutional neural networks, i.e. requiring much more data and computations to converge, which makes it difficult to generalize well on small datasets, which is common in practical applications. Previous works are either focusing on transferring knowledge from large datasets or adjusting the structure for small datasets. After carefully examining the self-attention modules, we discover that the number of trivial attention weights is far greater than the important ones and the accumulated trivial weights are dominating the attention in Vision Transformers due to their large quantity, which is not handled by the attention itself. This will cover useful non-trivial attention and harm the performance when trivial attention includes more noise, e.g. in shallow layers for some backbones. To solve this issue, we proposed to divide attention weights into trivial and non-trivial ones by thresholds, then Suppressing Accumulated Trivial Attention (SATA) weights by proposed Trivial WeIghts Suppression Transformation (TWIST) to reduce attention noise. Extensive experiments on CIFAR-100 and Tiny-ImageNet datasets show that our suppressing method boosts the accuracy of Vision Transformers by up to 2.3%. Code is available at https://github.com/xiangyu8/SATA.

updated: Sat Oct 22 2022 02:34:17 GMT+0000 (UTC)

published: Sat Oct 22 2022 02:34:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト