Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

Xitong Yang; Haoqi Fan; Lorenzo Torresani; Larry Davis; Heng Wang

短いクリップを超えて：コラボレーションメモリを使用したエンドツーエンドのビデオレベルの学習

ビデオモデルをトレーニングする標準的な方法では、各反復でビデオから1つのクリップをサンプリングし、ビデオレベルのラベルに関してクリップ予測を最適化します。ビデオデータセットはカテゴリ情報で弱くラベル付けされているが、密な時間的注釈がないことが多いため、単一のクリップでは、認識できるラベルを表示するのに十分な時間的カバレッジがない可能性があると主張します。さらに、短いクリップでモデルを最適化すると、長期的な時間依存性を学習する能力が妨げられます。これらの制限を克服するために、トレーニングの反復ごとにビデオの複数のサンプルクリップにわたって情報をエンコードする協調メモリメカニズムを導入します。これにより、単一のクリップを超えた長距離の依存関係の学習が可能になります。最適化の難しさを緩和するために、コラボレーティブメモリのさまざまな設計の選択肢を検討します。提案されたフレームワークはエンドツーエンドでトレーニング可能であり、ごくわずかな計算オーバーヘッドでビデオ分類の精度を大幅に向上させます。広範な実験を通じて、フレームワークがさまざまなビデオアーキテクチャとタスクに一般化され、アクション認識（Kinetics-400＆700、Charades、Something-Something-V1など）とアクション検出（例、 AVA v2.1＆v2.2）。

The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label. We argue that a single clip may not have enough temporal coverage to exhibit the label to recognize, since video datasets are often weakly labeled with categorical information but without dense temporal annotations. Furthermore, optimizing the model over brief clips impedes its ability to learn long-term temporal dependencies. To overcome these limitations, we introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. This enables the learning of long-range dependencies beyond a single clip. We explore different design choices for the collaborative memory to ease the optimization difficulties. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead. Through extensive experiments, we demonstrate that our framework generalizes to different video architectures and tasks, outperforming the state of the art on both action recognition (e.g., Kinetics-400 & 700, Charades, Something-Something-V1) and action detection (e.g., AVA v2.1 & v2.2).

updated: Fri Apr 02 2021 18:59:09 GMT+0000 (UTC)

published: Fri Apr 02 2021 18:59:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト