DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

Yuxuan Liang; Pan Zhou; Roger Zimmermann; Shuicheng Yan

DualFormer：効率的なビデオ認識のためのローカル-グローバル階層化トランスフォーマー

トランスフォーマーは、長距離の依存関係をキャプチャする強力な機能を備えたビデオ認識タスクに大きな可能性を示していますが、ビデオ内の膨大な数の3Dトークンに対する自己注意操作によって引き起こされる高い計算コストに苦しむことがよくあります。この論文では、DualFormerと呼ばれる新しいトランスアーキテクチャを提案します。これは、ビデオ認識のために時空間注意を効果的かつ効率的に実行できます。具体的には、DualFormerは、完全な時空の注意をデュアルカスケードレベルに階層化します。つまり、最初に近くの3Dトークン間の細粒度のローカル時空相互作用を学習し、次にクエリトークンと粗粒度のグローバルピラミッドコンテキスト。効率を向上させるために時空間因数分解を適用したり、ローカルウィンドウ内の注意計算を制限したりする既存の方法とは異なり、ローカル-グローバル階層化戦略は、短距離と長距離の両方の時空間依存関係を適切にキャプチャでき、その一方でキーと値の数を大幅に削減します効率を高めるための注意計算。実験結果は、既存の方法に対する5つのビデオベンチマークでのDualFormerの優位性を示しています。特に、DualFormerは、Kinetics-400 / 600で新しい最先端の82.9％/ 85.2％のトップ1精度を設定し、約1000Gの推論FLOPを備えています。これは、同様のパフォーマンスを持つ既存のメソッドよりも少なくとも3.2倍少なくなります。 https://github.com/sail-sg/dualformerでコードをリリースしました。

While transformers have shown great potential on video recognition tasks with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by self-attention operation on the huge number of 3D tokens in a video. In this paper, we propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition. Specifically, our DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local space-time interactions among nearby 3D tokens, followed by the capture of coarse-grained global dependencies between the query token and the coarse-grained global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratified strategy can well capture both short- and long-range spatiotemporal dependencies, and meanwhile greatly reduces the number of keys and values in attention computation to boost efficiency. Experimental results show the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with around 1000G inference FLOPs which is at least 3.2 times fewer than existing methods with similar performances. We have released our code at https://github.com/sail-sg/dualformer.

updated: Sun Jan 16 2022 12:36:49 GMT+0000 (UTC)

published: Thu Dec 09 2021 03:05:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト