Reducing the Vision and Language Bias for Temporal Sentence Grounding

Daizong Liu; Xiaoye Qu; Wei Hu

一時的な文の接地のための視覚と言語のバイアスを減らす

一時的な文の接地（TSG）は、マルチメディア情報検索において重要でありながら挑戦的なタスクです。以前のTSGメソッドはまともなパフォーマンスを達成しましたが、特にめったに出現しないペアに対して、堅牢なマルチモーダル推論機能を提示するのではなく、データセットに頻繁に出現するビデオクエリペアの選択バイアスをキャプチャする傾向があります。この論文では、選択バイアスの上記の問題を研究し、それに応じて、モデルの一般化能力を強化するために、視覚と言語の両方のモダリティの負のバイアスをフィルタリングおよび除去するためのバイアス除去-TSG（D-TSG）モデルを提案します。具体的には、2つの観点から問題を軽減することを提案します。1）特徴の蒸留。マルチモーダルバイアス除去ブランチを構築して、最初に視覚と言語のバイアスをキャプチャし、次にバイアス識別モジュールを適用して、真の負のバイアスを明示的に認識し、良性のマルチモーダル表現からそれらを削除します。 2）対照的なサンプル生成。 2種類のネガティブサンプルを作成して、モデルを強制し、整列されたマルチモーダルセマンティクスを正確に学習し、完全なセマンティック推論を行います。提案されたモデルを一般的およびめったに出現しないTSGケースの両方に適用し、3つのベンチマークデータセット（ActivityNet Caption、TACoS、およびCharades-STA）で最先端のパフォーマンスを達成することによってその有効性を示します。

Temporal sentence grounding (TSG) is an important yet challenging task in multimedia information retrieval. Although previous TSG methods have achieved decent performance, they tend to capture the selection biases of frequently appeared video-query pairs in the dataset rather than present robust multimodal reasoning abilities, especially for the rarely appeared pairs. In this paper, we study the above issue of selection biases and accordingly propose a Debiasing-TSG (D-TSG) model to filter and remove the negative biases in both vision and language modalities for enhancing the model generalization ability. Specifically, we propose to alleviate the issue from two perspectives: 1) Feature distillation. We built a multi-modal debiasing branch to firstly capture the vision and language biases, and then apply a bias identification module to explicitly recognize the true negative biases and remove them from the benign multi-modal representations. 2) Contrastive sample generation. We construct two types of negative samples to enforce the model to accurately learn the aligned multi-modal semantics and make complete semantic reasoning. We apply the proposed model to both commonly and rarely appeared TSG cases, and demonstrate its effectiveness by achieving the state-of-the-art performance on three benchmark datasets (ActivityNet Caption, TACoS, and Charades-STA).

updated: Wed Jul 27 2022 11:18:45 GMT+0000 (UTC)

published: Wed Jul 27 2022 11:18:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト