Coarse-Fine Networks for Temporal Activity Detection in Videos

Kumara Kahatapitiya; Michael S. Ryoo

ビデオの時間的活動検出のための粗い細かいネットワーク

この論文では、「Coarse-Fine Networks」を紹介します。これは、時間分解能のさまざまな抽象化の恩恵を受けて、長期的な動きのためのより良いビデオ表現を学習する2ストリームアーキテクチャです。従来のビデオモデルは、動的なフレームを選択せずに、1つ（または少数）の固定時間分解能で入力を処理します。ただし、入力の複数の時間解像度を処理し、各フレームの重要性を推定することを学習することによって動的に処理することで、特に時間アクティビティのローカリゼーションの領域で、ビデオ表現を大幅に改善できると主張します。この目的のために、（1）粗い特徴を抽出するための学習された時間的ダウンサンプリング層である「グリッドプール」、および（2）きめ細かいコンテキストを融合するための時空間的注意メカニズムである「多段階融合」を提案します。粗い特徴。私たちの方法は、計算とメモリのフットプリントを大幅に削減して、シャレードを含む公開データセットでのアクション検出の最先端を上回ることができることを示しています。

In this paper, we introduce 'Coarse-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion. Traditional Video models process inputs at one (or few) fixed temporal resolution without any dynamic frame selection. However, we argue that, processing multiple temporal resolutions of the input and doing so dynamically by learning to estimate the importance of each frame can largely improve video representations, specially in the domain of temporal activity localization. To this end, we propose (1) `Grid Pool', a learned temporal downsampling layer to extract coarse features, and, (2) `Multi-stage Fusion', a spatio-temporal attention mechanism to fuse a fine-grained context with the coarse features. We show that our method can outperform the state-of-the-arts for action detection in public datasets including Charades with a significantly reduced compute and memory footprint.

updated: Mon Mar 01 2021 20:48:01 GMT+0000 (UTC)

published: Mon Mar 01 2021 20:48:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト