Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding

Ziyue Wu; Junyu Gao; Shucheng Huang; Changsheng Xu

常識を意識した瞬間を学ぶ-高速ビデオ時間グラウンディングのためのテキストアラインメント

自然言語クエリで記述された一時的なビデオセグメントを効果的かつ効率的に接地することは、視覚と言語の分野で必要とされる重要な機能です。この論文では、高速で好ましい精度でターゲットセグメントをローカライズすることを目的として、高速ビデオ時間グラウンディング（FVTG）タスクを扱います。ほとんどの既存のアプローチは、テスト時のボトルネックに悩まされている接地性能を改善するために、精巧に設計されたクロスモーダル相互作用モジュールを採用しています。いくつかの一般的な宇宙ベースの方法は、推論中に高速のメリットを享受しますが、視覚的モダリティとテキストモダリティの間の包括的で明示的な関係を捉えることはほとんどできません。この論文では、速度と精度のトレードオフのジレンマに取り組むために、常識に基づく視覚的およびテキスト表現を補完的な共通空間に組み込んで高速ビデオの時間的接地を行う、常識を意識したクロスモーダルアライメント（CCA）フレームワークを提案します。具体的には、常識的な概念は、言語コーパスから構造的意味情報を抽出することによって調査および活用されます。次に、常識を意識したインタラクションモジュールは、学習した常識の概念を利用して、視覚的およびテキストのブリッジ機能を取得するように設計されています。最後に、テキストクエリの元の意味情報を維持するために、クロスモーダル補完共通空間が最適化されて、FVTGを実行するための一致スコアが取得されます。 2つの挑戦的なベンチマークでの広範な結果は、高速で実行している間、私たちのCCAメソッドが最先端技術に対して有利に機能することを示しています。私たちのコードはhttps://github.com/ZiyueWu59/CCAで入手できます。

Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. In this paper, we deal with the fast video temporal grounding (FVTG) task, aiming at localizing the target segment with high speed and favorable accuracy. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance, which suffer from the test-time bottleneck. Although several common space-based methods enjoy the high-speed merit during inference, they can hardly capture the comprehensive and explicit relations between visual and textual modalities. In this paper, to tackle the dilemma of speed-accuracy tradeoff, we propose a commonsense-aware cross-modal alignment (CCA) framework, which incorporates commonsense-guided visual and text representations into a complementary common space for fast video temporal grounding. Specifically, the commonsense concepts are explored and exploited by extracting the structural semantic information from a language corpus. Then, a commonsense-aware interaction module is designed to obtain bridged visual and text features by utilizing the learned commonsense concepts. Finally, to maintain the original semantic information of textual queries, a cross-modal complementary common space is optimized to obtain matching scores for performing FVTG. Extensive results on two challenging benchmarks show that our CCA method performs favorably against state-of-the-arts while running at high speed. Our code is available at https://github.com/ZiyueWu59/CCA.

updated: Tue Apr 12 2022 11:55:17 GMT+0000 (UTC)

published: Mon Apr 04 2022 13:07:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト