Efficient Attention-free Video Shift Transformers

Adrian Bulat; Brais Martinez; Georgios Tzimiropoulos

効率的で無関心なビデオシフトトランスフォーマー

この論文では、効率的なビデオ認識の問題に取り組みます。この分野では、最近、ビデオトランスフォーマーが効率 (トップ 1 の精度と FLOP) のスペクトルを支配しています。同時に、トランスフォーマーアーキテクチャ内でのセルフアテンション操作の必要性に挑戦する画像ドメインでの試みがいくつかあり、トークンの混合のためのより単純なアプローチの使用を提唱しています。ただし、ビデオ認識の場合の結果はまだありません。自己注意オペレーターは、効率に対して (画像の場合と比較して) 有意に高い影響を与えます。このギャップに対処するために、このホワイトペーパーでは、次のような貢献を行います。 Transformer 層の MHSA ブロックでの操作。アフィンシフトブロックに基づいて、アフィンシフトトランスフォーマーを構築し、ImageNet 分類の既存のすべてのシフト/MLP ベースのアーキテクチャよりも優れていることを示します。 (b) 私たちは、ビデオ領域での定式化を拡張して、ビデオアフィンシフトトランスフォーマー (VAST) を構築します。これは、まさに最初の純粋に無関心なシフトベースのビデオトランスフォーマーです。 (c) 計算量とメモリのフットプリントが小さいモデルの場合、最も一般的なアクション認識ベンチマークで、VAST が最近の最先端のトランスフォーマーよりも大幅に優れていることを示します。コードが利用可能になります。

This paper tackles the problem of efficient video recognition. In this area, video transformers have recently dominated the efficiency (top-1 accuracy vs FLOPs) spectrum. At the same time, there have been some attempts in the image domain which challenge the necessity of the self-attention operation within the transformer architecture, advocating the use of simpler approaches for token mixing. However, there are no results yet for the case of video recognition, where the self-attention operator has a significantly higher impact (compared to the case of images) on efficiency. To address this gap, in this paper, we make the following contributions: (a) we construct a highly efficient \& accurate attention-free block based on the shift operator, coined Affine-Shift block, specifically designed to approximate as closely as possible the operations in the MHSA block of a Transformer layer. Based on our Affine-Shift block, we construct our Affine-Shift Transformer and show that it already outperforms all existing shift/MLP--based architectures for ImageNet classification. (b) We extend our formulation in the video domain to construct Video Affine-Shift Transformer (VAST), the very first purely attention-free shift-based video transformer. (c) We show that VAST significantly outperforms recent state-of-the-art transformers on the most popular action recognition benchmarks for the case of models with low computational and memory footprint. Code will be made available.

updated: Tue Aug 23 2022 17:48:29 GMT+0000 (UTC)

published: Tue Aug 23 2022 17:48:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト