How Do Vision Transformers Work?

Namuk Park; Songkuk Kim

ビジョントランスフォーマーはどのように機能しますか？

How Do Vision Transformers Work?

コンピュータビジョンのためのマルチヘッドセルフアテンション（MSA）の成功は、今や議論の余地がありません。ただし、MSAがどのように機能するかについてはほとんどわかっていません。 MSAの性質をよりよく理解するのに役立つ基本的な説明を提示します。特に、MSAとVision Transformers（ViT）の次の特性を示します。（1）MSAは、損失の状況を平坦化することにより、精度だけでなく一般化も改善します。このような改善は主に、長期的な依存関係ではなく、データの特異性に起因します。一方、ViTは非凸損失に悩まされています。大規模なデータセットと損失ランドスケープ平滑化手法により、この問題が軽減されます。（2）MSAとConvsは反対の動作を示します。たとえば、MSAはローパスフィルターですが、Convsはハイパスフィルターです。したがって、MSAとConvsは補完的です。（3）多段ニューラルネットワークは、小さな個々のモデルの直列接続のように動作します。さらに、ステージの最後にあるMSAは、予測において重要な役割を果たします。これらの洞察に基づいて、ステージ終了時のConvブロックをMSAブロックに置き換えるモデルであるAlterNetを提案します。 AlterNetは、大規模なデータレジームだけでなく、小規模なデータレジームでもCNNよりも優れています。コードはhttps://github.com/xxxnell/how-do-vits-workで入手できます。

The success of multi-head self-attentions (MSAs) for computer vision is now indisputable. However, little is known about how MSAs work. We present fundamental explanations to help better understand the nature of MSAs. In particular, we demonstrate the following properties of MSAs and Vision Transformers (ViTs): (1) MSAs improve not only accuracy but also generalization by flattening the loss landscapes. Such improvement is primarily attributable to their data specificity, not long-range dependency. On the other hand, ViTs suffer from non-convex losses. Large datasets and loss landscape smoothing methods alleviate this problem; (2) MSAs and Convs exhibit opposite behaviors. For example, MSAs are low-pass filters, but Convs are high-pass filters. Therefore, MSAs and Convs are complementary; (3) Multi-stage neural networks behave like a series connection of small individual models. In addition, MSAs at the end of a stage play a key role in prediction. Based on these insights, we propose AlterNet, a model in which Conv blocks at the end of a stage are replaced with MSA blocks. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes. The code is available at https://github.com/xxxnell/how-do-vits-work.

updated: Wed May 11 2022 16:51:05 GMT+0000 (UTC)

published: Mon Feb 14 2022 13:58:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト