Temporal-attentive Covariance Pooling Networks for Video Recognition

Zilin Gao; Qilong Wang; Bingbing Zhang; Qinghua Hu; Peihua Li

ビデオ認識のための時間的注意の共分散プーリングネットワーク

ビデオ認識タスクの場合、ビデオスニペットのコンテンツ全体を要約するグローバル表現が最終的なパフォーマンスに重要な役割を果たします。ただし、既存のビデオアーキテクチャは通常、ビデオの複雑なダイナミクスをキャプチャする機能が制限されている、単純なグローバル平均プーリング（GAP）メソッドを使用して生成します。画像認識タスクの場合、共分散プーリングがGAPよりも強力な表現能力を持っていることを示す証拠が存在します。残念ながら、画像認識で使用されるこのような単純な共分散プーリングは無秩序な代表であり、ビデオに固有の時空間構造をモデル化することはできません。したがって、この論文では、強力なビデオ表現を生成するために、深いアーキテクチャの最後に挿入される時間的注意共分散プーリング（TCP）を提案します。具体的には、TCPは最初に、後続の共分散プーリングの時空間特徴を適応的に較正するための時間的注意モジュールを開発し、注意深い共分散表現を近似的に生成します。次に、時間的共分散プーリングは、注意深い共分散表現の時間的プーリングを実行して、キャリブレーションされた特徴のフレーム内相関とフレーム間相互相関の両方を特徴付けます。そのため、提案されたTCPは、複雑な時間的ダイナミクスをキャプチャできます。最後に、共分散表現のジオメトリを活用するために、高速行列パワー正規化が導入されています。 TCPはモデルに依存せず、任意のビデオアーキテクチャに柔軟に統合できるため、効果的なビデオ認識のためのTCPNetが得られることに注意してください。さまざまなビデオアーキテクチャを使用した6つのベンチマーク（Kinetics、Something-Something V1、Charadesなど）での広範な実験により、TCPNetは、強力な一般化機能を備えながら、対応するものよりも明らかに優れていることがわかります。ソースコードは公開されています。

For video recognition task, a global representation summarizing the whole contents of the video snippets plays an important role for the final performance. However, existing video architectures usually generate it by using a simple, global average pooling (GAP) method, which has limited ability to capture complex dynamics of videos. For image recognition task, there exist evidences showing that covariance pooling has stronger representation ability than GAP. Unfortunately, such plain covariance pooling used in image recognition is an orderless representative, which cannot model spatio-temporal structure inherent in videos. Therefore, this paper proposes a Temporal-attentive Covariance Pooling(TCP), inserted at the end of deep architectures, to produce powerful video representations. Specifically, our TCP first develops a temporal attention module to adaptively calibrate spatio-temporal features for the succeeding covariance pooling, approximatively producing attentive covariance representations. Then, a temporal covariance pooling performs temporal pooling of the attentive covariance representations to characterize both intra-frame correlations and inter-frame cross-correlations of the calibrated features. As such, the proposed TCP can capture complex temporal dynamics. Finally, a fast matrix power normalization is introduced to exploit geometry of covariance representations. Note that our TCP is model-agnostic and can be flexibly integrated into any video architectures, resulting in TCPNet for effective video recognition. The extensive experiments on six benchmarks (e.g., Kinetics, Something-Something V1 and Charades) using various video architectures show our TCPNet is clearly superior to its counterparts, while having strong generalization ability. The source code is publicly available.

updated: Thu Oct 28 2021 01:49:03 GMT+0000 (UTC)

published: Wed Oct 27 2021 12:31:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト