Partially Relevant Video Retrieval

Jianfeng Dong; Xianke Chen; Minsong Zhang; Xun Yang; Shujie Chen; Xirong Li; Xun Wang

部分的に関連するビデオの検索

テキストからビデオへの検索 (T2VR) の現在の方法は、MSVD、MSR-VTT、VATEX などのビデオキャプション指向のデータセットでトレーニングおよびテストされています。これらのデータセットの重要な特性は、提供されたキャプションがビデオコンテンツの要点をよく説明している一方で、ビデオは短い時間で一時的に事前にトリミングされていると想定されることです。したがって、ビデオとキャプションがペアになっている場合、ビデオはキャプションに完全に関連していると見なされます。ただし、実際には、クエリはアプリオリにわからないため、事前にトリミングされたビデオクリップには、クエリを完全に満たすのに十分なコンテンツが含まれていない場合があります。これは、文学と現実世界のギャップを示唆しています。ギャップを埋めるために、この論文では、部分的に関連するビデオ検索 (PRVR) と呼ばれる新しい T2VR サブタスクを提案します。トリミングされていないビデオは、クエリに関連する瞬間が含まれている場合、特定のテキストクエリに対して部分的に関連していると見なされます。 PRVR は、トリミングされていないビデオの大規模なコレクションから、そのような部分的に関連するビデオを取得することを目的としています。 PRVR は、シングルビデオモーメント検索およびビデオコーパスモーメント検索とは異なります。後者の 2 つは、トリミングされていないビデオではなくモーメントを検索するためです。 PRVR を複数インスタンス学習 (MIL) 問題として定式化します。この問題では、ビデオがビデオクリップのバッグとビデオフレームのバッグとして同時に表示されます。クリップとフレームは、異なるタイムスケールでのビデオコンテンツを表します。 PRVR のクリップスケールとフレームスケールの類似性を共同で学習するマルチスケール類似性学習 (MS-SL) ネットワークを提案します。 3 つのデータセット (TVR、ActivityNet Captions、および Charades-STA) での広範な実験により、提案された方法の実行可能性が示されます。また、ビデオコーパスのモーメント検索を改善するために、この方法を使用できることも示します。

Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world. To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos. PRVR differs from single video moment retrieval and video corpus moment retrieval, as the latter two are to retrieve moments rather than untrimmed videos. We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different time scales. We propose a Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale and frame-scale similarities for PRVR. Extensive experiments on three datasets (TVR, ActivityNet Captions, and Charades-STA) demonstrate the viability of the proposed method. We also show that our method can be used for improving video corpus moment retrieval.

updated: Fri Aug 26 2022 09:07:16 GMT+0000 (UTC)

published: Fri Aug 26 2022 09:07:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト