Language-free Training for Zero-shot Video Grounding

Dahye Kim; Jungin Park; Jiyoung Lee; Seongheon Park; Kwanghoon Sohn

ゼロショットビデオグラウンディングのための言語のないトレーニング

トリミングされていないビデオと、ビデオ内の特定の一時的な瞬間を表す言語クエリが与えられた場合、ビデオグラウンディングは、テキストとビデオを同時に理解することによって時間間隔をローカライズすることを目的としています。最も困難な問題の 1 つは、自然言語形式のビデオキャプションとそれに対応する時間領域を含む、非常に時間とコストのかかる注釈の収集です。このホワイトペーパーでは、ゼロショット設定でのビデオグラウンディングのためのシンプルでありながら新しいトレーニングフレームワークを提示します。これは、アノテーションなしでビデオデータのみを使用してネットワークを学習します。最近の言語に依存しないパラダイム、つまり言語データを使用しないトレーニングに着想を得て、偽の (疑似) テキストクエリを自然言語形式に強制的に生成することなく、ネットワークをトレーニングします。具体的には、仮想的な正解として時間間隔を選択し、その間隔で選択された視覚的特徴を言語特徴として考慮することにより、ビデオグラウンディングモデルを学習する方法を提案します。 CLIPのスペース。広範な実験により、言語を使用しないトレーニングフレームワークの卓越性が実証され、既存のゼロショットビデオグラウンディングメソッドや、2 つの標準データセットで大きなマージンを持ついくつかの教師付きの弱いアプローチよりも優れています。

Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct answer and considering the visual feature selected by our method in the interval as a language feature, with the help of the well-aligned visual-language space of CLIP. Extensive experiments demonstrate the prominence of our language-free training framework, outperforming the existing zero-shot video grounding method and even several weakly-supervised approaches with large margins on two standard datasets.

updated: Mon Oct 24 2022 06:55:29 GMT+0000 (UTC)

published: Mon Oct 24 2022 06:55:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト