Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning

Shuaicheng Li; Feng Zhang; Kunlin Yang; Lingbo Liu; Shinan Liu; Jun Hou; Shuai Yi

ハードペアガイド対照学習によるビデオハイライト検出のためのビジュアルオーディオ表現のプロービング

ビデオハイライトの検出は、トリミングされていないビデオの興味深い瞬間を特定することを目的とした、重要でありながら挑戦的な問題です。このタスクの鍵は、2つの目標、つまり、クロスモーダル表現学習ときめ細かい特徴識別を共同で追求する効果的なビデオ表現にあります。この論文では、これらの2つの課題は、表現モデリングのためのモダリティ内およびモダリティ間の関係を強化するだけでなく、識別的な方法で機能を形成することによっても対処されます。私たちの提案する方法は、完全な表現モデリングのために、主にモダリティ内エンコーディングとモダリティ間共起エンコーディングを活用します。具体的には、モダリティ内エンコーディングは、モダリティに関する機能を強化し、オーディオ信号とビジュアル信号の両方でのモダリティ内関係学習を介して、無関係なモダリティを抑制します。一方、クロスモダリティ共起エンコーディングは、共起相互モダリティ関係に焦点を当て、マルチモダリティ間で効果的な情報を選択的にキャプチャします。マルチモーダル表現は、ローカルコンテキストから抽象化されたグローバル情報によってさらに強化されます。さらに、ハードペアガイド対照学習（HPCL）スキームを使用して、機能埋め込みの識別力を拡大します。 HPCLの特徴識別を改善するためにハードサンプルをマイニングするために、ハードペアサンプリング戦略がさらに採用されています。 2つのベンチマークで実施された広範な実験は、他の最先端の方法と比較して、提案された方法の有効性と優位性を示しています。

Video highlight detection is a crucial yet challenging problem that aims to identify the interesting moments in untrimmed videos. The key to this task lies in effective video representations that jointly pursue two goals, i.e., cross-modal representation learning and fine-grained feature discrimination. In this paper, these two challenges are tackled by not only enriching intra-modality and cross-modality relations for representation modeling but also shaping the features in a discriminative manner. Our proposed method mainly leverages the intra-modality encoding and cross-modality co-occurrence encoding for fully representation modeling. Specifically, intra-modality encoding augments the modality-wise features and dampens irrelevant modality via within-modality relation learning in both audio and visual signals. Meanwhile, cross-modality co-occurrence encoding focuses on the co-occurrence inter-modality relations and selectively captures effective information among multi-modality. The multi-modal representation is further enhanced by the global information abstracted from the local context. In addition, we enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning (HPCL) scheme. A hard-pairs sampling strategy is further employed to mine the hard samples for improving feature discrimination in HPCL. Extensive experiments conducted on two benchmarks demonstrate the effectiveness and superiority of our proposed methods compared to other state-of-the-art methods.

updated: Tue Jun 21 2022 07:29:37 GMT+0000 (UTC)

published: Tue Jun 21 2022 07:29:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト