Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair

Sho Maeoki; Yusuke Mukuta; Tatsuya Harada

関連する可能性のあるペアを使用した多対多の対応を考慮したテキストクエリによるビデオモーメント検索

この論文では、ビデオのコーパスからテキストベースのビデオモーメント検索のタスクを実行します。モデルをトレーニングするために、テキストとモーメントのペアのデータセットを使用して、正しい対応を学習しました。一般的なトレーニング方法では、グラウンドトゥルーステキストモーメントペアが正のペアとして使用され、他のペアは負のペアと見なされます。ただし、グラウンドトゥルースペアとは別に、一部のテキストとモーメントのペアはポジティブと見なす必要があります。この場合、1つのテキスト注釈が多くのビデオの瞬間にプラスになる可能性があります。逆に、1つのビデオモーメントは多くのテキスト注釈に対応できます。したがって、テキスト注釈とビデオモーメントの間には多対多の対応があります。これらの対応に基づいて、潜在的に関連性のあるペアを形成できます。これは、グラウンドトゥルースとしては与えられていませんが、負ではありません。このような関連するペアをトレーニングに効果的に組み込むことで、検索パフォーマンスを向上させることができます。テキストクエリは、ビデオの瞬間に何が起こっているかを説明する必要があります。したがって、同様のアクションを含む同様のテキストで注釈が付けられた異なるビデオモーメントは、同様のアクションを保持する可能性が高く、したがって、これらのペアは潜在的に関連するペアと見なすことができます。本論文では、テキスト注釈に関する言語分析に基づいて検出された、潜在的に関連性のあるペアを利用する新しいトレーニング方法を提案します。 2つのベンチマークデータセットでの実験により、私たちの方法が定量的および定性的に検索パフォーマンスを向上させることが明らかになりました。

In this paper we undertake the task of text-based video moment retrieval from a corpus of videos. To train the model, text-moment paired datasets were used to learn the correct correspondences. In typical training methods, ground-truth text-moment pairs are used as positive pairs, whereas other pairs are regarded as negative pairs. However, aside from the ground-truth pairs, some text-moment pairs should be regarded as positive. In this case, one text annotation can be positive for many video moments. Conversely, one video moment can be corresponded to many text annotations. Thus, there are many-to-many correspondences between the text annotations and video moments. Based on these correspondences, we can form potentially relevant pairs, which are not given as ground truth yet are not negative; effectively incorporating such relevant pairs into training can improve the retrieval performance. The text query should describe what is happening in a video moment. Hence, different video moments annotated with similar texts, which contain a similar action, are likely to hold the similar action, thus these pairs can be considered as potentially relevant pairs. In this paper, we propose a novel training method that takes advantage of potentially relevant pairs, which are detected based on linguistic analysis about text annotation. Experiments on two benchmark datasets revealed that our method improves the retrieval performance both quantitatively and qualitatively.

updated: Fri Jun 25 2021 11:25:18 GMT+0000 (UTC)

published: Fri Jun 25 2021 11:25:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト