Towards Long-Form Video Understanding

Chao-Yuan Wu; Philipp Krähenbühl

長い形式のビデオ理解に向けて

私たちの世界は終わりのない視覚刺激の流れを提供していますが、今日の視覚システムは数秒以内にパターンを正確に認識するだけです。これらのシステムは現在を理解しますが、過去または将来のイベントでそれを文脈化することはできません。この論文では、長い形式のビデオ理解を研究します。長い形式のビデオをモデル化するためのフレームワークを紹介し、大規模なデータセットで評価プロトコルを開発します。既存の最先端の短期モデルは、長期的なタスクに限定されていることを示します。新しいオブジェクト中心のトランスフォーマーベースのビデオ認識アーキテクチャは、7つの多様なタスクで大幅に優れたパフォーマンスを発揮します。また、AVAデータセットの同等の最先端技術よりも優れています。

Our world offers a never-ending stream of visual stimuli, yet today's vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. We show that existing state-of-the-art short-term models are limited for long-form tasks. A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks. It also outperforms comparable state-of-the-art on the AVA dataset.

updated: Mon Jun 21 2021 17:59:52 GMT+0000 (UTC)

published: Mon Jun 21 2021 17:59:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト