TSI: Temporal Saliency Integration for Video Action Recognition

Haisheng Su; Kunchang Li; Jinyuan Feng; Dongliang Wang; Weihao Gan; Wei Wu; Yu Qiao

TSI：ビデオアクション認識のための時間的顕著性統合

効率的な時空間モデリングは、ビデオアクション認識にとって重要でありながら挑戦的な問題です。既存の最先端の方法は、隣接する機能の違いを利用して、単純な畳み込みによる短期間の時間モデリングのモーションの手がかりを取得します。ただし、受容野が限られているため、さまざまな種類のアクションを処理できないローカル畳み込みは1つだけです。さらに、カメラの動きによってもたらされるアクションに関係のないノイズも、抽出されたモーションフィーチャの品質に悪影響を及ぼします。この論文では、主にSalient Motion Excitation（SME）モジュールとCross-perception Temporal Integration（CTI）モジュールを含むTemporal Saliency Integration（TSI）ブロックを提案します。具体的には、SMEは、空間レベルのローカルグローバルモーションモデリングを通じてモーションセンシティブエリアを強調することを目的としています。このモデリングでは、隣接するフレーム間で顕著性の調整とピラミッド型のモーションモデリングが連続して実行され、背景のずれによって引き起こされるノイズが少ないモーションダイナミクスをキャプチャします。 CTIは、それぞれ個別の1D畳み込みのグループを介してマルチ知覚時間モデリングを実行するように設計されています。一方、異なる知覚にわたる時間的相互作用は、注意メカニズムと統合されています。これらの2つのモジュールを通じて、限られた追加パラメーターを導入することにより、長期的な短期間の時間的関係を効率的にエンコードできます。いくつかの一般的なベンチマーク（つまり、Something-Something V1＆V2、Kinetics-400、UCF-101、およびHMDB-51）で広範な実験が行われ、提案された方法の有効性が実証されています。

Efficient spatiotemporal modeling is an important yet challenging problem for video action recognition. Existing state-of-the-art methods exploit neighboring feature differences to obtain motion clues for short-term temporal modeling with a simple convolution. However, only one local convolution is incapable of handling various kinds of actions because of the limited receptive field. Besides, action-irrelated noises brought by camera movement will also harm the quality of extracted motion features. In this paper, we propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-perception Temporal Integration (CTI) module. Specifically, SME aims to highlight the motion-sensitive area through spatial-level local-global motion modeling, where the saliency alignment and pyramidal motion modeling are conducted successively between adjacent frames to capture motion dynamics with fewer noises caused by misaligned background. CTI is designed to perform multi-perception temporal modeling through a group of separate 1D convolutions respectively. Meanwhile, temporal interactions across different perceptions are integrated with the attention mechanism. Through these two modules, long short-term temporal relationships can be encoded efficiently by introducing limited additional parameters. Extensive experiments are conducted on several popular benchmarks (i.e., Something-Something V1 & V2, Kinetics-400, UCF-101, and HMDB-51), which demonstrate the effectiveness of our proposed method.

updated: Wed Dec 15 2021 06:54:09 GMT+0000 (UTC)

published: Wed Jun 02 2021 11:43:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト