CabViT: Cross Attention among Blocks for Vision Transformer

Haokui Zhang; Wenze Hu; Xiaoyu Wang

CabViT: ビジョントランスフォーマーのブロック間のクロスアテンション

ビジョントランスフォーマー (ViT) が画像分類で目覚ましい性能を達成して以来、ますます多くの研究者がより効率的なビジョントランスフォーマーモデルの設計に注意を払っています。一般的な研究ラインでは、スパースアテンションを採用するか、ローカルアテンションウィンドウを使用することにより、セルフアテンションモジュールの計算コストを削減しています。対照的に、注意パターンを高密度化することにより、高性能トランスベースのアーキテクチャを設計することを提案します。具体的には、トランスフォーマーのマルチヘッドアテンションへの追加入力として同じステージの前のブロックからのトークンを使用する ViT (CabViT) のブロック間のクロスアテンションを提案します。提案されたCabViTは、潜在的に異なるセマンティクスを持つブロック間のトークンの相互作用を強化し、より低いレベルへのより多くの情報フローを促進します。これにより、モデルのパフォーマンスとモデルの収束が最小限の追加コストで改善されます。提案されたCabViTに基づいて、モデルサイズ、計算コスト、および精度の間で最良のトレードオフを達成する一連のCabViTモデルを設計します。たとえば、トレーニングを強化するための知識の蒸留を必要とせずに、CabViT は Imagenet で 83.0% のトップ 1 精度を達成し、わずか 1,630 万のパラメーターと約 3.9G FLOP を使用して、ほぼ半分のパラメーターと 13% の計算コストを節約しながら、比較して 0.9% 高い精度を得ています。 ConvNext では、パラメーターの 52% を使用しますが、蒸留された EfficientFormer と比較して 0.6% の精度を獲得します

Since the vision transformer (ViT) has achieved impressive performance in image classification, an increasing number of researchers pay their attentions to designing more efficient vision transformer models. A general research line is reducing computational cost of self attention modules by adopting sparse attention or using local attention windows. In contrast, we propose to design high performance transformer based architectures by densifying the attention pattern. Specifically, we propose cross attention among blocks of ViT (CabViT), which uses tokens from previous blocks in the same stage as extra input to the multi-head attention of transformers. The proposed CabViT enhances the interactions of tokens across blocks with potentially different semantics, and encourages more information flows to the lower levels, which together improves model performance and model convergence with limited extra cost. Based on the proposed CabViT, we design a series of CabViT models which achieve the best trade-off between model size, computational cost and accuracy. For instance without the need of knowledge distillation to strength the training, CabViT achieves 83.0% top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.9G FLOPs, saving almost half parameters and 13% computational cost while gaining 0.9% higher accuracy compared with ConvNext, use 52% of parameters but gaining 0.6% accuracy compared with distilled EfficientFormer

updated: Mon Nov 14 2022 08:43:44 GMT+0000 (UTC)

published: Mon Nov 14 2022 08:43:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト