Fcaformer: Forward Cross Attention in Hybrid Vision Transformer

Haokui Zhang; Wenze Hu; Xiaoyu Wang

Fcaformer: ハイブリッドビジョントランスフォーマーにおけるフォワードクロスアテンション

現在、より効率的なビジョントランスフォーマーを設計するための主要な研究ラインの 1 つは、スパースアテンションを採用するか、ローカルアテンションウィンドウを使用することによって、セルフアテンションモジュールの計算コストを削減することです。対照的に、注意パターンを高密度化することにより、トランスフォーマーベースのアーキテクチャのパフォーマンスを向上させることを目的とした別のアプローチを提案します。具体的には、同じステージの前のブロックからのトークンが二次的に使用される、ハイブリッドビジョントランスフォーマー (FcaFormer) のフォワードクロスアテンションを提案しました。これを実現するために、FcaFormer は 2 つの革新的なコンポーネントを活用しています: 学習可能なスケールファクター (LSF) とトークンマージおよび拡張モジュール (TME)。 LSF はクロストークンの効率的な処理を可能にし、TME は代表的なクロストークンを生成します。これらのコンポーネントを統合することにより、提案された FcaFormer は、潜在的に異なるセマンティクスを持つブロック間のトークンの相互作用を強化し、下位レベルへのより多くの情報の流れを促進します。フォワードクロスアテンション (Fca) に基づいて、モデルサイズ、計算コスト、メモリコスト、および精度の間で最適なトレードオフを実現する一連の FcaFormer モデルを設計しました。たとえば、トレーニングを強化するための知識の蒸留を必要とせずに、当社の FcaFormer は Imagenet で 83.1% のトップ 1 精度を達成し、わずか 1,630 万のパラメーターと約 36 億の MAC があります。これにより、蒸留された EfficientFormer と比較して 0.7% 高い精度を達成しながら、パラメーターのほぼ半分といくつかの計算コストを節約できます。

Currently, one main research line in designing a more efficient vision transformer is reducing the computational cost of self attention modules by adopting sparse attention or using local attention windows. In contrast, we propose a different approach that aims to improve the performance of transformer-based architectures by densifying the attention pattern. Specifically, we proposed forward cross attention for hybrid vision transformer (FcaFormer), where tokens from previous blocks in the same stage are secondary used. To achieve this, the FcaFormer leverages two innovative components: learnable scale factors (LSFs) and a token merge and enhancement module (TME). The LSFs enable efficient processing of cross tokens, while the TME generates representative cross tokens. By integrating these components, the proposed FcaFormer enhances the interactions of tokens across blocks with potentially different semantics, and encourages more information flows to the lower levels. Based on the forward cross attention (Fca), we have designed a series of FcaFormer models that achieve the best trade-off between model size, computational cost, memory cost, and accuracy. For example, without the need for knowledge distillation to strengthen training, our FcaFormer achieves 83.1% top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.6 billion MACs. This saves almost half of the parameters and a few computational costs while achieving 0.7% higher accuracy compared to distilled EfficientFormer.

updated: Mon Mar 20 2023 03:43:27 GMT+0000 (UTC)

published: Mon Nov 14 2022 08:43:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト