On Modality Bias in the TVQA Dataset

Thomas Winterbottom; Sarah Xiao; Alistair McLean; Noura Al Moubayed

TVQAデータセットのモダリティバイアスについて

TVQAは、人気のあるテレビ番組に基づく大規模なビデオ質問応答（video-QA）データセットです。質問は、「答えるにはビジョンと言語理解の両方」を必要とするように特別に設計されました。この作業では、テキスト字幕モダリティに対するデータセットの固有のバイアスを示します。私たちは、直接的にも間接的にもバイアスを推測します。特に、字幕でトレーニングされたモデルは、平均して、ビデオ機能の寄与を抑制することを学習します。私たちの結果は、視覚情報のみでトレーニングされたモデルが質問の約45％に答えることができ、字幕のみを使用すると約68％を達成できることを示しています。モダリティの双一次プーリングベースの共同表現は、モデルのパフォーマンスに9％のダメージを与えることがわかりました。これは、モダリティ固有の情報への依存を意味します。また、TVQAがVQAで普及しているRUBiモダリティバイアス削減手法の恩恵を受けられないことも示しています。 TVQA用に最初に提案された単純なモデルでBERT埋め込みを使用してテキスト処理を改善するだけで、非常に複雑なSTAGEモデル（70.50％）と比較して最先端の結果（72.13％）を達成します。モデルのバイアスを強調し、視覚的およびテキストに依存するデータのサブセットを分離できるマルチモーダル評価フレームワークをお勧めします。このフレームワークを使用して、TVQAが当初意図したようにマルチモーダルモデリングを容易にするために、いずれかまたは両方のモダリティに排他的に応答するTVQAのサブセットを提案します。

TVQA is a large scale video question answering (video-QA) dataset based on popular TV shows. The questions were specifically designed to require "both vision and language understanding to answer". In this work, we demonstrate an inherent bias in the dataset towards the textual subtitle modality. We infer said bias both directly and indirectly, notably finding that models trained with subtitles learn, on-average, to suppress video feature contribution. Our results demonstrate that models trained on only the visual information can answer ~45% of the questions, while using only the subtitles achieves ~68%. We find that a bilinear pooling based joint representation of modalities damages model performance by 9% implying a reliance on modality specific information. We also show that TVQA fails to benefit from the RUBi modality bias reduction technique popularised in VQA. By simply improving text processing using BERT embeddings with the simple model first proposed for TVQA, we achieve state-of-the-art results (72.13%) compared to the highly complex STAGE model (70.50%). We recommend a multimodal evaluation framework that can highlight biases in models and isolate visual and textual reliant subsets of data. Using this framework we propose subsets of TVQA that respond exclusively to either or both modalities in order to facilitate multimodal modelling as TVQA originally intended.

updated: Fri Dec 18 2020 13:06:23 GMT+0000 (UTC)

published: Fri Dec 18 2020 13:06:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト