ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Stéphane d'Ascoli; Hugo Touvron; Matthew Leavitt; Ari Morcos; Giulio Biroli; Levent Sagun

ConViT：ソフト畳み込み誘導バイアスによるビジョントランスフォーマーの改善

畳み込みアーキテクチャは、ビジョンタスクで非常に成功していることが証明されています。それらのハードな誘導バイアスは、サンプル効率の高い学習を可能にしますが、パフォーマンスの上限が低くなる可能性があります。 Vision Transformers（ViT）は、より柔軟な自己注意レイヤーに依存しており、最近、画像分類に関してCNNを上回っています。ただし、大規模な外部データセットでのコストのかかる事前トレーニング、または事前トレーニングされた畳み込みネットワークからの蒸留が必要です。このホワイトペーパーでは、次の質問をします。それぞれの制限を回避しながら、これら2つのアーキテクチャの長所を組み合わせることができるでしょうか。この目的のために、ゲート付き位置的自己注意（GPSA）を導入します。これは、「ソフト」畳み込み誘導バイアスを装備できる位置的自己注意の形式です。畳み込み層の局所性を模倣するようにGPSA層を初期化し、位置情報とコンテンツ情報に支払われる注意を調整するゲーティングパラメーターを調整することにより、各注意ヘッドに局所性を回避する自由を与えます。結果として得られる畳み込みのようなViTアーキテクチャであるConViTは、ImageNetのDeiTを上回り、サンプル効率が大幅に向上します。最初にバニラの自己注意層でどのように促進されるかを定量化し、次にGPSA層でどのように逃れるかを分析することにより、学習における局所性の役割をさらに調査します。 ConViTの成功をよりよく理解するために、さまざまなアブレーションを提示することで結論を下します。私たちのコードとモデルは公開されています。

Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias. We initialize the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analyzing how it is escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. Our code and models are released publicly.

updated: Fri Mar 19 2021 09:11:20 GMT+0000 (UTC)

published: Fri Mar 19 2021 09:11:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト