Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding

Daizong Liu; Xiaoye Qu; Pan Zhou

出席するための漸進的なガイド：時間的文の接地のための反復的な整列フレームワーク

時間的文の根拠（TSG）の重要な解決策は、トリミングされていないビデオと文の説明から抽出された視覚と言語の特徴の間の効果的な調整を学習する方法にあります。既存の方法は、主にバニラソフトアテンションを活用して、シングルステッププロセスでアライメントを実行します。ただし、モダリティ間とモダリティ内の複雑な関係は通常、マルチステップの推論によって得られるため、このようなシングルステップの注意は実際には不十分です。この論文では、TSGタスク用の反復アライメントネットワーク（IA-Net）を提案します。これは、より正確な接地のために、複数のステップ内でモード間およびモーダル内の機能を繰り返し相互作用します。具体的には、反復推論プロセス中に、マルチモーダル機能に学習可能なパラメーターを埋め込み、一致しないフレームと単語のペアのどこにも参加できない問題を軽減し、基本的な共同注意メカニズムを並行して強化します。各推論ステップによって引き起こされた不整合な注意をさらに較正するために、各注意モジュールに続く較正モジュールを考案して、整列の知識を洗練させます。このような反復的な調整スキームにより、IA-Netは、時間的境界を段階的に推論するために、ビジョンと言語ドメイン間のきめ細かい関係を段階的に確実にキャプチャできます。 3つの挑戦的なベンチマークで実施された広範な実験は、提案されたモデルが最先端のものよりも優れていることを示しています。

A key solution to temporal sentence grounding (TSG) exists in how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description. Existing methods mainly leverage vanilla soft attention to perform the alignment in a single-step process. However, such single-step attention is insufficient in practice, since complicated relations between inter- and intra-modality are usually obtained through multi-step reasoning. In this paper, we propose an Iterative Alignment Network (IA-Net) for TSG task, which iteratively interacts inter- and intra-modal features within multiple steps for more accurate grounding. Specifically, during the iterative reasoning process, we pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs, and enhance the basic co-attention mechanism in a parallel manner. To further calibrate the misaligned attention caused by each reasoning step, we also devise a calibration module following each attention module to refine the alignment knowledge. With such iterative alignment scheme, our IA-Net can robustly capture the fine-grained relations between vision and language domains step-by-step for progressively reasoning the temporal boundaries. Extensive experiments conducted on three challenging benchmarks demonstrate that our proposed model performs better than the state-of-the-arts.

updated: Tue Sep 14 2021 02:08:23 GMT+0000 (UTC)

published: Tue Sep 14 2021 02:08:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト