Vision Transformers with Patch Diversification

Chengyue Gong; Dilin Wang; Meng Li; Vikas Chandra; Qiang Liu

パッチ多様化を備えたビジョントランスフォーマー

ビジョントランスフォーマーは、困難なコンピュータービジョンタスクで有望なパフォーマンスを実証しています。ただし、ビジョントランスフォーマーを直接トレーニングすると、不安定で最適ではない結果が生じる可能性があります。最近の研究では、例えば畳み込み層を組み込むなど、変圧器の構造を変更することにより、ビジョン変圧器の性能を改善することが提案されています。対照的に、ネットワークを変更せずにビジョントランスフォーマートレーニングを安定させるための直交アプローチを調査します。トレーニングの不安定性は、抽出されたパッチ表現全体の有意な類似性に起因する可能性があることがわかります。より具体的には、ディープビジョントランスフォーマーの場合、自己注意ブロックは、異なるパッチを同様の潜在表現にマッピングする傾向があり、情報の損失とパフォーマンスの低下をもたらします。この問題を軽減するために、この作業では、視覚トランスフォーマーのトレーニングに新しい損失関数を導入して、パッチ表現全体の多様性を明示的に促進し、より識別力のある特徴抽出を実現します。提案された手法がトレーニングを安定させ、より広く、より深いビジョントランスフォーマーをトレーニングできることを経験的に示します。さらに、多様な機能が転移学習の下流のタスクに大きく役立つことを示します。セマンティックセグメンテーションについては、CityscapesとADE20kの最先端（SOTA）の結果を強化します。私たちのコードはまもなく公開されます。

Vision transformer has demonstrated promising performance on challenging computer vision tasks. However, directly training the vision transformers may yield unstable and sub-optimal results. Recent works propose to improve the performance of the vision transformers by modifying the transformer structures, e.g., incorporating convolution layers. In contrast, we investigate an orthogonal approach to stabilize the vision transformer training without modifying the networks. We observe the instability of the training can be attributed to the significant similarity across the extracted patch representations. More specifically, for deep vision transformers, the self-attention blocks tend to map different patches into similar latent representations, yielding information loss and performance degradation. To alleviate this problem, in this work, we introduce novel loss functions in vision transformer training to explicitly encourage diversity across patch representations for more discriminative feature extraction. We empirically show that our proposed techniques stabilize the training and allow us to train wider and deeper vision transformers. We further show the diversified features significantly benefit the downstream tasks in transfer learning. For semantic segmentation, we enhance the state-of-the-art (SOTA) results on Cityscapes and ADE20k. Our code will be made publicly available soon.

updated: Thu Jun 10 2021 05:55:42 GMT+0000 (UTC)

published: Mon Apr 26 2021 17:43:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト