Coarse-Fine Networks for Temporal Activity Detection in Videos

Kumara Kahatapitiya; Michael S. Ryoo

ビデオの時間的活動検出のための粗い細かいネットワーク

この論文では、時間分解能のさまざまな抽象化の恩恵を受けて、長期的な動きのためのより良いビデオ表現を学習する2ストリームアーキテクチャであるCoarse-FineNetworksを紹介します。従来のビデオモデルは、動的なフレームを選択せずに、1つ（または少数）の固定時間分解能で入力を処理します。ただし、入力の複数の時間解像度を処理し、各フレームの重要性を推定することを学習することによって動的に処理することで、特に時間アクティビティのローカリゼーションの領域で、ビデオ表現を大幅に改善できると主張します。この目的のために、（1）粗い特徴を抽出するための学習された時間的ダウンサンプリング層であるグリッドプール、および（2）粗い特徴と細かいコンテキストを融合するための時空間的注意メカニズムである多段階融合を提案します。私たちの方法は、計算とメモリフットプリントが大幅に削減された、シャレードを含む公開データセットでのアクション検出の最先端を上回っていることを示しています。コードはhttps://github.com/kkahatapitiya/Coarse-Fine-Networksで入手できます。

In this paper, we introduce Coarse-Fine Networks, a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion. Traditional Video models process inputs at one (or few) fixed temporal resolution without any dynamic frame selection. However, we argue that, processing multiple temporal resolutions of the input and doing so dynamically by learning to estimate the importance of each frame can largely improve video representations, specially in the domain of temporal activity localization. To this end, we propose (1) Grid Pool, a learned temporal downsampling layer to extract coarse features, and, (2) Multi-stage Fusion, a spatio-temporal attention mechanism to fuse a fine-grained context with the coarse features. We show that our method outperforms the state-of-the-arts for action detection in public datasets including Charades with a significantly reduced compute and memory footprint. The code is available at https://github.com/kkahatapitiya/Coarse-Fine-Networks

updated: Thu Apr 01 2021 17:57:04 GMT+0000 (UTC)

published: Mon Mar 01 2021 20:48:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト