Video Moment Retrieval from Text Queries via Single Frame Annotation

Ran Cui; Tianwen Qian; Pai Peng; Elena Daskalaki; Jingjing Chen; Xiaowei Guo; Huyang Sun; Yu-Gang Jiang

単一フレーム注釈によるテキストクエリからのビデオモーメント検索

ビデオモーメント検索は、特定の自然言語クエリによって記述されたモーメント（ビデオの一部）の開始タイムスタンプと終了タイムスタンプを見つけることを目的としています。完全に監視された方法では、有望な結果を達成するために完全な時間境界注釈が必要です。これは、注釈者が瞬間全体を監視する必要があるため、コストがかかります。弱教師ありメソッドは、ビデオとクエリのペアにのみ依存しますが、パフォーマンスは比較的低くなります。この論文では、注釈プロセスを詳しく調べ、「一見注釈」と呼ばれる新しいパラダイムを提案します。このパラダイムでは、完全に監視された対応物の時間的境界内で、「一瞥」と呼ばれる単一のランダムフレームのタイムスタンプのみが必要です。弱い監視と比較すると、わずかなコストが追加され、パフォーマンスの可能性がさらに高まるため、これは有益であると私たちは主張します。一目注釈設定の下で、対照学習に基づく一目注釈（ViGA）によるビデオモーメント検索という名前の方法を提案します。 ViGAは、入力ビデオをクリップにカットし、クリップとクエリを対比します。ここでは、一目でガイドされるガウス分布の重みがすべてのクリップに割り当てられます。私たちの広範な実験は、ViGAが、場合によっては完全に監視された方法に匹敵する場合でも、最先端の弱く監視された方法よりも大幅に優れた結果を達成することを示しています。

Video moment retrieval aims at finding the start and end timestamps of a moment (part of a video) described by a given natural language query. Fully supervised methods need complete temporal boundary annotations to achieve promising results, which is costly since the annotator needs to watch the whole moment. Weakly supervised methods only rely on the paired video and query, but the performance is relatively poor. In this paper, we look closer into the annotation process and propose a new paradigm called "glance annotation". This paradigm requires the timestamp of only one single random frame, which we refer to as a "glance", within the temporal boundary of the fully supervised counterpart. We argue this is beneficial because comparing to weak supervision, trivial cost is added yet more potential in performance is provided. Under the glance annotation setting, we propose a method named as Video moment retrieval via Glance Annotation (ViGA) based on contrastive learning. ViGA cuts the input video into clips and contrasts between clips and queries, in which glance guided Gaussian distributed weights are assigned to all clips. Our extensive experiments indicate that ViGA achieves better results than the state-of-the-art weakly supervised methods by a large margin, even comparable to fully supervised methods in some cases.

updated: Sat Jun 18 2022 12:56:41 GMT+0000 (UTC)

published: Wed Apr 20 2022 11:59:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト