Dual Vision Transformer

Ting Yao; Yehao Li; Yingwei Pan; Yu Wang; Xiao-Ping Zhang; Tao Mei

デュアルビジョントランスフォーマー

以前の研究では、自己注意メカニズムの計算コストを削減するためのいくつかの戦略が提案されています。これらの作業の多くは、自己注意手順を地域およびローカルの特徴抽出手順に分解することを検討しています。これらの手順では、計算の複雑さがはるかに小さくなります。ただし、地域情報は通常、ダウンサンプリングによって失われた望ましくない情報を犠牲にしてのみ達成されます。このホワイトペーパーでは、コストの問題を軽減することを目的とした、Dual Vision Transformer（Dual-ViT）という新しいTransformerアーキテクチャを提案します。新しいアーキテクチャには、複雑さの順序を減らしてトークンベクトルをグローバルセマンティクスに効率的に圧縮できる重要なセマンティックパスウェイが組み込まれています。このような圧縮されたグローバルセマンティクスは、別の構築されたピクセル経路を通じて、より細かいピクセルレベルの詳細を学習する際の有用な事前情報として機能します。次に、セマンティックパスウェイとピクセルパスウェイが統合され、共同でトレーニングされ、強化された自己注意情報が両方のパスウェイを通じて並行して拡散されます。 Dual-ViTは、今後、精度を大幅に低下させることなく、計算の複雑さを軽減することができます。 Dual-ViTは、トレーニングの複雑さを軽減し、SOTATransformerアーキテクチャよりも優れた精度を提供することを経験的に示しています。ソースコードはhttps://github.com/YehLi/ImageNetModelで入手できます。

Prior works have proposed several strategies to reduce the computational cost of self-attention mechanism. Many of these works consider decomposing the self-attention procedure into regional and local feature extraction procedures that each incurs a much smaller computational complexity. However, regional information is typically only achieved at the expense of undesirable information lost owing to down-sampling. In this paper, we propose a novel Transformer architecture that aims to mitigate the cost issue, named Dual Vision Transformer (Dual-ViT). The new architecture incorporates a critical semantic pathway that can more efficiently compress token vectors into global semantics with reduced order of complexity. Such compressed global semantics then serve as useful prior information in learning finer pixel level details, through another constructed pixel pathway. The semantic pathway and pixel pathway are then integrated together and are jointly trained, spreading the enhanced self-attention information in parallel through both of the pathways. Dual-ViT is henceforth able to reduce the computational complexity without compromising much accuracy. We empirically demonstrate that Dual-ViT provides superior accuracy than SOTA Transformer architectures with reduced training complexity. Source code is available at https://github.com/YehLi/ImageNetModel.

updated: Mon Jul 11 2022 16:03:44 GMT+0000 (UTC)

published: Mon Jul 11 2022 16:03:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト