AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Lingchen Meng; Hengduo Li; Bor-Chun Chen; Shiyi Lan; Zuxuan Wu; Yu-Gang Jiang; Ser-Nam Lim

AdaViT：効率的な画像認識のためのアダプティブビジョントランスフォーマー

自己注意メカニズムの上に構築されたビジョントランスフォーマーは、最近、さまざまなビジョンタスクで優れたパフォーマンスを発揮しています。優れたパフォーマンスを実現する一方で、パッチ、セルフアテンションヘッド、トランスブロックの数が増えると大幅にスケールアップする比較的集中的な計算コストが必要になります。この論文では、画像間のばらつきが大きいため、パッチ間の長距離依存関係をモデル化する必要性が異なると主張します。この目的のために、AdaViTを紹介します。これは、入力ごとにバックボーン全体で使用するパッチ、自己注意ヘッド、トランスフォーマーブロックの使用ポリシーを導出することを学習する適応計算フレームワークであり、ビジョントランスフォーマーの推論効率を向上させることを目的としています。画像認識の精度の低下を最小限に抑えます。エンドツーエンドの方法で変圧器バックボーンと共同で最適化された軽量の意思決定ネットワークがバックボーンに接続され、オンザフライで意思決定を行います。 ImageNetでの広範な実験により、私たちの方法では、最新のビジョントランスフォーマーと比較して、精度がわずか0.8％低下し、効率が2倍以上向上し、さまざまな計算予算を条件として、効率と精度のトレードオフが良好になることが示されています。さらに、学習した使用ポリシーについて定量的および定性的な分析を行い、ビジョントランスフォーマーの冗長性に関するより多くの洞察を提供します。

Built on top of self-attention mechanisms, vision transformers have demonstrated remarkable performance on a variety of vision tasks recently. While achieving excellent performance, they still require relatively intensive computational cost that scales up drastically as the numbers of patches, self-attention heads and transformer blocks increase. In this paper, we argue that due to the large variations among images, their need for modeling long-range dependencies between patches differ. To this end, we introduce AdaViT, an adaptive computation framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use throughout the backbone on a per-input basis, aiming to improve inference efficiency of vision transformers with a minimal drop of accuracy for image recognition. Optimized jointly with a transformer backbone in an end-to-end manner, a light-weight decision network is attached to the backbone to produce decisions on-the-fly. Extensive experiments on ImageNet demonstrate that our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy, achieving good efficiency/accuracy trade-offs conditioned on different computational budgets. We further conduct quantitative and qualitative analysis on learned usage polices and provide more insights on the redundancy in vision transformers.

updated: Tue Nov 30 2021 18:57:02 GMT+0000 (UTC)

published: Tue Nov 30 2021 18:57:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト