Group Contextualization for Video Recognition

Yanbin Hao; Hao Zhang; Chong-Wah Ngo; Xiangnan He

ビデオ認識のためのグループコンテキスト化

複雑な時空間動的空間から識別表現を学習することは、ビデオ認識に不可欠です。これらの定型化された時空間計算ユニットに加えて、学習した機能を軸方向のコンテキストでさらに洗練することは、この目標を達成する上で有望であることが実証されています。ただし、以前の作品は一般に、単一の種類のコンテキストを利用して機能チャネル全体を調整することに焦点を当てており、多様なビデオアクティビティを処理するために適用することはほとんどできませんでした。この問題は、ペアワイズの時空間的注意を使用して、大量の計算を犠牲にして、軸間コンテキストで機能応答を再計算することで対処できます。本論文では、特徴チャネルをいくつかのグループに分解し、それらを異なる軸方向のコンテキストで並列に別々に洗練する効率的な特徴改良方法を提案した。この軽量機能のキャリブレーションをグループコンテキスト化（GC）と呼びます。具体的には、効率的な要素ごとのキャリブレータのファミリ、つまりECal-G / S / T / Lを設計します。ここで、軸コンテキストは、機能チャネルグループをコンテキスト化するために、他の軸からグローバルまたはローカルに集約された情報ダイナミクスです。 GCモジュールは、既製のビデオネットワークの残りの各レイヤーに密に接続できます。計算のオーバーヘッドがほとんどないため、さまざまなネットワークにGCを接続すると、一貫した改善が見られます。キャリブレータを利用して、4つの異なる種類のコンテキストで機能を並列に埋め込むことにより、学習された表現は、さまざまな種類のアクティビティに対してより回復力があることが期待されます。時間的変化が豊富なビデオでは、経験的にGCは2D-CNN（TSNやTSMなど）のパフォーマンスを最先端のビデオネットワークに匹敵するレベルまで高めることができます。コードはhttps://github.com/haoyanbin918/Group-Contextualizationで入手できます。

Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized spatio-temporal computational units, further refining the learnt feature with axial contexts is demonstrated to be promising in achieving this goal. However, previous works generally focus on utilizing a single kind of contexts to calibrate entire feature channels and could hardly apply to deal with diverse video activities. The problem can be tackled by using pair-wise spatio-temporal attentions to recompute feature response with cross-axis contexts at the expense of heavy computations. In this paper, we propose an efficient feature refinement method that decomposes the feature channels into several groups and separately refines them with different axial contexts in parallel. We refer this lightweight feature calibration as group contextualization (GC). Specifically, we design a family of efficient element-wise calibrators, i.e., ECal-G/S/T/L, where their axial contexts are information dynamics aggregated from other axes either globally or locally, to contextualize feature channel groups. The GC module can be densely plugged into each residual layer of the off-the-shelf video networks. With little computational overhead, consistent improvement is observed when plugging in GC on different networks. By utilizing calibrators to embed feature with four different kinds of contexts in parallel, the learnt representation is expected to be more resilient to diverse types of activities. On videos with rich temporal variations, empirically GC can boost the performance of 2D-CNN (e.g., TSN and TSM) to a level comparable to the state-of-the-art video networks. Code is available at https://github.com/haoyanbin918/Group-Contextualization.

updated: Fri Mar 18 2022 01:49:40 GMT+0000 (UTC)

published: Fri Mar 18 2022 01:49:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト