DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

Yuxuan Liang; Pan Zhou; Roger Zimmermann; Shuicheng Yan

DualFormer: 効率的なビデオ認識のためのローカル/グローバル層化トランスフォーマー

トランスフォーマーは、長距離の依存関係をキャプチャする強力な機能を備えているため、ビデオ認識で大きな可能性を示していますが、膨大な数の 3D トークンへの自己注意によって引き起こされる高い計算コストに悩まされることがよくあります。この論文では、ビデオ認識のための時空間注意を効率的に実行できる、DualFormer と呼ばれる新しいトランスアーキテクチャを紹介します。具体的には、DualFormer は完全な時空注意を二重のカスケードレベルに層別化します。つまり、最初に近くの 3D トークン間のきめ細かなローカルインタラクションを学習し、次にクエリトークンとグローバルピラミッドコンテキスト間の粗粒度のグローバル依存関係をキャプチャします。時空間分解を適用したり、効率を向上させるためにローカルウィンドウ内で注意計算を制限したりする既存の方法とは異なり、ローカル/グローバル成層化戦略は、短期および長期の両方の時空間依存性をうまく捉えることができ、その間にキーと値の数を大幅に削減できます。注意計算で効率を高めます。実験結果は、既存の方法に対する 5 つのビデオベンチマークでの DualFormer の優位性を検証します。特に、DualFormer は、Kinetics-400/600 で 82.9%/85.2% のトップ 1 精度を達成し、~1000G の推論 FLOP は、同様のパフォーマンスを持つ既存の方法よりも少なくとも 3.2 倍少なくなります。 https://github.com/sail-sg/dualformer でソースコードをリリースしました。

While transformers have shown great potential on video recognition with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by the self-attention to the huge number of 3D tokens. In this paper, we present a new transformer architecture termed DualFormer, which can efficiently perform space-time attention for video recognition. Concretely, DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local interactions among nearby 3D tokens, and then to capture coarse-grained global dependencies between the query token and global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratification strategy can well capture both short- and long-range spatiotemporal dependencies, and meanwhile greatly reduces the number of keys and values in attention computation to boost efficiency. Experimental results verify the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer achieves 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with ~1000G inference FLOPs which is at least 3.2x fewer than existing methods with similar performance. We have released the source code at https://github.com/sail-sg/dualformer.

updated: Tue Nov 22 2022 09:41:50 GMT+0000 (UTC)

published: Thu Dec 09 2021 03:05:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト