UniVTG: Towards Unified Video-Language Temporal Grounding

Kevin Qinghong Lin; Pengchuan Zhang; Joya Chen; Shraman Pramanick; Difei Gao; Alex Jinpeng Wang; Rui Yan; Mike Zheng Shou

UniVTG: ビデオ言語の時間的グラウンディングの統合に向けて

ビデオテンポラルグラウンディング (VTG) は、カスタム言語クエリ (文や単語など) に従ってビデオからターゲットクリップ (連続した間隔やばらばらのショットなど) をグラウンディングすることを目的としており、ソーシャルメディアでのビデオ閲覧の鍵となります。この方向のほとんどの方法は、モーメント取得（時間間隔）やハイライト検出（価値曲線）などのタイプ固有のラベルを使用してトレーニングされたタスク固有のモデルを開発します。これにより、さまざまな VTG タスクやラベルに一般化する機能が制限されます。このペーパーでは、UniVTG と呼ばれる多様な VTG ラベルとタスクを 3 つの方向に統一することを提案します。まず、広範囲にわたる VTG ラベルとタスクを再検討し、統一された定式化を定義します。これに基づいて、スケーラブルな疑似監視を作成するデータアノテーションスキームを開発します。次に、各タスクに対処し、各ラベルを最大限に活用できる、効果的かつ柔軟なグラウンディングモデルを開発します。最後に、統一されたフレームワークのおかげで、大規模で多様なラベルから一時的なグラウンディングの事前トレーニングを解放し、ゼロショットグラウンディングなどのより強力なグラウンディング能力を開発することができます。 7 つのデータセット (QVHighlights、Charades-STA、TACoS、Ego4D、YouTube Highlights、TVSum、および QFVS) にわたる 3 つのタスク (モーメント取得、ハイライト検出、ビデオ要約) に関する広範な実験により、提案したフレームワークの有効性と柔軟性が実証されました。コードは https://github.com/showlab/UniVTG で入手できます。

Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG.

updated: Mon Jul 31 2023 14:34:49 GMT+0000 (UTC)

published: Mon Jul 31 2023 14:34:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト