G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

Hongxiang Li; Meng Cao; Xuxin Cheng; Yaowei Li; Zhihong Zhu; Yuexian Zou

G2L: 測地線とゲーム理論による意味的に調整された均一なビデオグラウンディング

最近のビデオグラウンディング作品では、バニラの対照学習をビデオグラウンディングに導入しようとしています。しかし、私たちは、この単純な解決策は最適ではないと主張します。対照学習には 2 つの重要な特性が必要です。(1) 類似サンプルの特徴の位置合わせ、および (2) 超球上の正規化された特徴の誘導分布の均一性。ビデオグラウンディングには次の 2 つの厄介な問題があります。(1) グラウンドトゥルースと他の瞬間の両方における一部の視覚的エンティティの共存、つまり意味論的な重複。 (2) ビデオ内の少数の瞬間のみに注釈が付けられます。つまり、注釈がまばらなジレンマ、バニラの対照学習では、時間的に離れた瞬間と学習された一貫性のないビデオ表現の間の相関関係をモデル化できません。両方の特性により、バニラの対照学習はビデオの基礎には不向きになります。このペーパーでは、測地線とゲーム理論を介して意味的に調整された均一なビデオグラウンディングフレームワークである Geodesic and Game Localization (G2L) を紹介します。モデルが正しいクロスモーダル表現を学習するように導く測地線距離を利用して、モーメント間の相関を定量化します。さらに、ゲーム理論の新しい観点から、同様の瞬間におけるきめの細かい意味論的調整を学習するために、測地線距離サンプリングに基づく意味論的Shapley相互作用を提案します。 3 つのベンチマークでの実験により、私たちの手法の有効性が実証されました。

The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) alignment of features of similar samples, and (2) uniformity of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, i.e. semantic overlapping; (2) only a few moments in the video are annotated, i.e. sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.

updated: Tue Nov 14 2023 06:03:35 GMT+0000 (UTC)

published: Wed Jul 26 2023 16:14:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト