VideoGLUE: Video General Understanding Evaluation of Foundation Models

Liangzhe Yuan; Nitesh Bharadwaj Gundavarapu; Long Zhao; Hao Zhou; Yin Cui; Lu Jiang; Xuan Yang; Menglin Jia; Tobias Weyand; Luke Friedman; Mikhail Sirotenko; Huisheng Wang; Florian Schroff; Hartwig Adam; Ming-Hsuan Yang; Ting Liu; Boqing Gong

VideoGLUE: 基礎モデルのビデオ一般理解評価

私たちは、3 つの特徴的なタスク (動作認識、時間的位置特定、時空間的位置特定)、コミュニティに好評の 8 つのデータセット、および基礎モデル (FM) を調整する 4 つの適応方法で構成される慎重に設計された実験プロトコルを使用して、既存の基礎モデルのビデオ理解機能を評価します。下流のタスク用。さらに、一般的なビデオ理解タスクに適応する場合の FM の有効性と効率を測定するために、スカラー VideoGLUE スコア (VGS) を提案します。主な結果は以下の通りです。まず、タスクに特化したモデルは、この研究で研究した 6 つの FM を大幅に上回っており、自然言語と画像の理解において FM が達成した成果とは対照的です。第 2 に、事前トレーニングデータにビデオモダリティが含まれているビデオネイティブ FM は、モーションの豊富なビデオの分類、時間内でのアクションの位置特定、および複数のアクションのビデオの理解において、イメージネイティブ FM よりも優れています。第三に、ビデオネイティブ FM は、ダウンストリームタスク (FM バックボーンのフリーズなど) への軽い適応下でビデオタスクで良好にパフォーマンスを発揮できますが、イメージネイティブ FM は完全なエンドツーエンドの微調整で優れています。最初の 2 つの観察は、ビデオに焦点を当てた FM に関する研究を実施する必要性と多大な機会を明らかにし、最後の観察は、FM の評価に関してはタスクと適応方法の両方が重要であることを確認しています。

We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks. Our main findings are as follows. First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second,video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks(e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs.

updated: Thu Jul 06 2023 17:47:52 GMT+0000 (UTC)

published: Thu Jul 06 2023 17:47:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト