ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Yufei Xu; Qiming Zhang; Jing Zhang; Dacheng Tao

ViTAE: 固有の誘導バイアスの調査による高度なビジョントランスフォーマー

トランスフォーマーは、自己注意メカニズムを使用して長距離依存関係をモデル化する強力な能力により、さまざまなコンピュータービジョンタスクで大きな可能性を示しています。それにもかかわらず、ビジョントランスフォーマーは画像を視覚トークンの 1D シーケンスとして扱い、局所的な視覚構造のモデル化とスケール変動の処理における固有の誘導バイアス (IB) を欠いています。または、IB を暗黙的に学習するには、大規模なトレーニングデータとより長いトレーニングスケジュールが必要です。この論文では、畳み込みから固有の IB を探索することにより、新しいビジョントランスフォーマーアドバンスト、つまり ViTAE を提案します。技術的には、ViTAE には、さまざまな膨張率の複数の畳み込みを使用して、入力画像をダウンサンプリングして豊富なマルチスケールコンテキストを持つトークンに埋め込むためのいくつかの空間ピラミッド削減モジュールがあります。このようにして、固有のスケール不変性 IB を取得し、さまざまなスケールでオブジェクトのロバストな特徴表現を学習することができます。さらに、各トランスレイヤーに、ViTAE にはマルチヘッド自己注意モジュールと並列に畳み込みブロックがあり、その機能が融合されてフィードフォワードネットワークに供給されます。したがって、それは本質的な局所性 IB を持ち、局所的な特徴と大域的な依存関係を協調して学ぶことができます。 ImageNet およびダウンストリームタスクでの実験は、ベースライントランスフォーマーおよび並行作業に対する ViTAE の優位性を証明しています。ソースコードと事前トレーニング済みモデルは GitHub で入手できます。

Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e. , ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context by using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and is able to learn robust feature representation for objects at various scales. Moreover, in each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network. Consequently, it has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively. Experiments on ImageNet as well as downstream tasks prove the superiority of ViTAE over the baseline transformer and concurrent works. Source code and pretrained models will be available at GitHub.

updated: Mon Jun 07 2021 05:31:06 GMT+0000 (UTC)

published: Mon Jun 07 2021 05:31:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト