Shifted Chunk Transformer for Spatio-Temporal Representational Learning

Xuefan Zha; Wentao Zhu; Tingxun Lv; Sen Yang; Ji Liu

時空間表象学習のためのシフトチャンクトランスフォーマー

時空間表象学習は、行動認識、ビデオオブジェクトセグメンテーション、行動予測などのさまざまな分野で広く採用されています。以前の時空間表現学習アプローチは、主にConvNetまたはシーケンシャルモデル（LSTMなど）を使用して、フレーム内およびフレーム間の機能を学習します。最近、Transformerモデルは、自然言語処理（NLP）、画像分類などの研究を首尾よく支配しています。ただし、純粋なTransformerベースの時空間学習は、小さなものからきめ細かい特徴を抽出するためのメモリと計算に法外なコストがかかる可能性があります。パッチ。トレーニングの難しさに取り組み、時空間学習を強化するために、純粋な自己注意ブロックを使用してシフトチャンクトランスフォーマーを構築します。 NLPの最近の効率的なTransformer設計を活用して、このシフトされたチャンクTransformerは、ローカルの小さなパッチからグローバルなビデオクリップまでの階層的な時空間機能を学習できます。私たちのシフトした自己注意は、複雑なフレーム間の分散を効果的にモデル化することもできます。さらに、長期的な時間依存性をモデル化するために、Transformerに基づくクリップエンコーダーを構築します。シフトされたチャンクトランスフォーマーの各コンポーネントとハイパーパラメーターを検証するために徹底的なアブレーション研究を実施し、Kinetics-400、Kinetics-600、UCF101、およびHMDB51の以前の最先端のアプローチよりも優れています。

Spatio-temporal representational learning has been widely adopted in various fields such as action recognition, video object segmentation, and action anticipation. Previous spatio-temporal representational learning approaches primarily employ ConvNets or sequential models,e.g., LSTM, to learn the intra-frame and inter-frame features. Recently, Transformer models have successfully dominated the study of natural language processing (NLP), image classification, etc. However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch. To tackle the training difficulty and enhance the spatio-temporal learning, we construct a shifted chunk Transformer with pure self-attention blocks. Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global video clip. Our shifted self-attention can also effectively model complicated inter-frame variances. Furthermore, we build a clip encoder based on Transformer to model long-term temporal dependencies. We conduct thorough ablation studies to validate each component and hyper-parameters in our shifted chunk Transformer, and it outperforms previous state-of-the-art approaches on Kinetics-400, Kinetics-600, UCF101, and HMDB51.

updated: Thu Oct 28 2021 02:54:22 GMT+0000 (UTC)

published: Thu Aug 26 2021 04:34:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト