Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Zhenzhi Wang; Limin Wang; Tao Wu; Tianhao Li; Gangshan Wu

否定的なサンプル事項：時間的接地のための計量学習のルネッサンス

時間的グラウンディングは、特定の自然言語クエリと意味的に一致するビデオモーメントをローカライズすることを目的としています。既存の方法は通常、複雑な予測ヘッドまたは融合戦略の設計に焦点を当てた研究で、融合表現に検出または回帰パイプラインを適用します。代わりに、メトリック学習問題としての時間的グラウンディングの観点から、言語クエリと共同埋め込みスペースのビデオモーメントとの間の類似性を直接モデル化するために、相互マッチングネットワーク（MMN）を提示します。この新しいメトリック学習フレームワークにより、2つの新しい側面からネガティブサンプルを完全に活用できます。相互マッチングスキームでネガティブクロスモーダルペアを構築することと、異なるビデオ間でネガティブペアをマイニングすることです。これらの新しいネガティブサンプルは、相互情報量を最大化するために、クロスモーダル相互マッチングを介して2つのモダリティの共同表現学習を強化する可能性があります。実験によると、私たちのMMNは、4つのビデオ接地ベンチマークで最先端の方法と比較して非常に競争力のあるパフォーマンスを達成しています。 MMNに基づいて、第3回PICワークショップのHC-STVGチャレンジの勝者ソリューションを提示します。これは、メトリック学習が、共同埋め込み空間で本質的なクロスモーダル相関をキャプチャすることにより、時間的接地のための有望な方法であることを示唆しています。コードはhttps://github.com/MCG-NJU/MMNで入手できます。

Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space. Code is available at https://github.com/MCG-NJU/MMN.

updated: Wed Dec 15 2021 08:12:17 GMT+0000 (UTC)

published: Fri Sep 10 2021 13:38:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト