Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius; Heng Wang; Lorenzo Torresani

ビデオを理解するために必要なのは時空間の注意だけですか？

空間と時間にわたる自己注意のみに基づいて構築されたビデオ分類への畳み込みのないアプローチを提示します。「TimeSformer」という名前の私たちの方法は、フレームレベルのパッチのシーケンスから直接時空間特徴学習を可能にすることにより、標準のTransformerアーキテクチャをビデオに適応させます。私たちの実験的研究は、さまざまな自己注意スキームを比較し、時間的注意と空間的注意が各ブロック内で別々に適用される「分割注意」が、考慮される設計選択の中で最高のビデオ分類精度につながることを示唆しています。根本的に新しい設計にもかかわらず、TimeSformerは、Kinetics-400およびKinetics-600で報告された最高の精度を含む、いくつかのアクション認識ベンチマークで最先端の結果を達成します。最後に、3D畳み込みネットワークと比較して、私たちのモデルはトレーニングが速く、劇的に高いテスト効率を達成でき（精度がわずかに低下します）、はるかに長いビデオクリップ（1分以上）にも適用できます。コードとモデルは、https：//github.com/facebookresearch/TimeSformerで入手できます。

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: https://github.com/facebookresearch/TimeSformer.

updated: Fri Apr 16 2021 14:41:50 GMT+0000 (UTC)

published: Tue Feb 09 2021 19:49:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト