TSI: Temporal Saliency Integration for Video Action Recognition

Haisheng Su; Jinyuan Feng; Dongliang Wang; Weihao Gan; Wei Wu; Yu Qiao

TSI：ビデオアクション認識のための時間的顕著性統合

効率的な時空間モデリングは、ビデオアクション認識にとって重要でありながら挑戦的な問題です。既存の最先端の方法は、モーションの手がかりを利用して、連続するフレームの時間差を通じて短期間の時間モデリングを支援します。ただし、カメラの動きにより、必然的にわずかなノイズが発生します。その上、さまざまなアクションの動きは大きく異なる可能性があります。この論文では、主にSalient Motion Excitation（SME）モジュールとCross-scale Temporal Integration（CTI）モジュールを含むTemporal Saliency Integration（TSI）ブロックを提案します。具体的には、SMEは、ローカル-グローバルモーションモデリングを通じてモーションセンシティブエリアを強調することを目指しています。このモデリングでは、顕著性の位置合わせとピラミッド型の特徴の違いが隣接するフレーム間で連続して実行され、背景の位置ずれによって引き起こされるノイズの少ないモーションダイナミクスをキャプチャします。 CTIは、それぞれ個別の1D畳み込みのグループを介してマルチスケールの時間モデリングを実行するように設計されています。一方、異なるスケールにわたる時間的相互作用は注意メカニズムと統合されています。これらの2つのモジュールを通じて、限られた追加パラメーターを導入することにより、長期的な短期間の時間的関係を効率的にエンコードできます。いくつかの一般的なベンチマーク（つまり、Something-Something V1＆V2、Kinetics-400、UCF-101、およびHMDB-51）で広範な実験が行われ、提案された方法の有効性と優位性が実証されています。

Efficient spatiotemporal modeling is an important yet challenging problem for video action recognition. Existing state-of-the-art methods exploit motion clues to assist in short-term temporal modeling through temporal difference over consecutive frames. However, insignificant noises will be inevitably introduced due to the camera movement. Besides, movements of different actions can vary greatly. In this paper, we propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module. Specifically, SME aims to highlight the motion-sensitive area through local-global motion modeling, where the saliency alignment and pyramidal feature difference are conducted successively between neighboring frames to capture motion dynamics with less noises caused by misaligned background. CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively. Meanwhile, temporal interactions across different scales are integrated with attention mechanism. Through these two modules, long short-term temporal relationships can be encoded efficiently by introducing limited additional parameters. Extensive experiments are conducted on several popular benchmarks (i.e., Something-Something V1 & V2, Kinetics-400, UCF-101, and HMDB-51), which demonstrate the effectiveness and superiority of our proposed method.

updated: Wed Sep 08 2021 07:25:18 GMT+0000 (UTC)

published: Wed Jun 02 2021 11:43:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト