Hierarchical Local-Global Transformer for Temporal Sentence Grounding

Xiang Fang; Daizong Liu; Pan Zhou; Zichuan Xu; Ruixuan Li

時系列文グラウンディングのための階層型ローカル-グローバルトランスフォーマー

この論文では、特定の文のクエリに従って、トリミングされていないビデオの特定のビデオセグメントを正確に決定することを目的とした、一時的な文のグラウンディング (TSG) のマルチメディア問題を研究します。従来の TSG メソッドは、主にトップダウンまたはボトムアップのフレームワークに従い、エンドツーエンドではありません。彼らは、グラウンディング結果を改善するために、時間のかかる後処理に大きく依存しています。最近、ビデオとクエリの間のきめの細かいセマンティックアラインメントを効率的かつ効果的にモデル化するために、いくつかのトランスフォーマーベースのアプローチが提案されています。これらの方法はある程度のパフォーマンスを達成しますが、ビデオのフレームとクエリの単語を相互に関連付けるためのトランスフォーマー入力として等しく取得し、異なるレベルの粒度を明確なセマンティクスで捉えることができません。この問題に対処するために、このホワイトペーパーでは、この階層情報を活用し、さまざまなレベルの粒度とさまざまなモダリティ間の相互作用をモデル化して、よりきめ細かいマルチモーダル表現を学習するための新しい Hierarchical Local-Global Transformer (HLGT) を提案します。具体的には、最初にビデオとクエリを個々のクリップとフレーズに分割し、それらのローカルコンテキスト (隣接する依存関係) とグローバルな相関関係 (長距離依存関係) をテンポラルトランスフォーマーを介して学習します。次に、グローバルローカルトランスフォーマーを導入して、ローカルレベルのセマンティクスとグローバルレベルのセマンティクスの間の相互作用を学習し、マルチモーダルな推論を改善します。さらに、2 つのモダリティ間の相互作用を強制し、それらの間のセマンティックアラインメントを促進するために、新しいクロスモーダルサイクル一貫性損失を開発します。最後に、コード化されたビジュアル機能とテキスト機能を統合して最終的な接地を行うために、まったく新しいクロスモーダルパラレルトランスフォーマーデコーダーを設計します。 3 つの挑戦的なデータセットに関する広範な実験により、提案された HLGT が新しい最先端のパフォーマンスを達成することが示されています。

This paper studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this paper, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained multi-modal representations. Specifically, we first split the video and query into individual clips and phrases to learn their local context (adjacent dependency) and global correlation (long-range dependency) via a temporal transformer. Then, a global-local transformer is introduced to learn the interactions between the local-level and global-level semantics for better multi-modal reasoning. Besides, we develop a new cross-modal cycle-consistency loss to enforce interaction between two modalities and encourage the semantic alignment between them. Finally, we design a brand-new cross-modal parallel transformer decoder to integrate the encoded visual and textual features for final grounding. Extensive experiments on three challenging datasets show that our proposed HLGT achieves a new state-of-the-art performance.

updated: Wed Aug 31 2022 14:16:56 GMT+0000 (UTC)

published: Wed Aug 31 2022 14:16:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト