Read, Look or Listen? What's Needed for Solving a Multimodal Dataset

Netta Madvil; Yonatan Bitton; Roy Schwartz

読む、見る、聞く?マルチモーダルデータセットを解決するために必要なもの

大規模なマルチモーダルデータセットの普及により、データセットの品質を評価する際に特有の課題が生じています。私たちは、マルチモーダルデータセットを分析するための 2 段階の方法を提案します。これは、ヒューマンアノテーションの小さなシードを活用して、各マルチモーダルインスタンスを、その処理に必要なモダリティにマッピングします。私たちの方法は、データセット内のさまざまなモダリティの重要性と、それらの間の関係を明らかにします。私たちは、ビデオ質問応答データセットである TVQA にアプローチを適用し、特定のモダリティに実質的に偏ることなく、単一のモダリティを使用してほとんどの質問に回答できることを発見しました。さらに、質問の 70% 以上が、ビデオを見るか音声を聞くなど、いくつかの異なる単一モダリティ戦略を使用して解決できることがわかり、TVQA における複数のモダリティの統合が限られていることを浮き彫りにしています。私たちは注釈を活用してメルロー保護区を分析し、テキストや音声と比較して画像ベースの質問に苦戦しているだけでなく、聴覚話者の識別にも苦戦していることを発見しました。私たちの観察に基づいて、複数のモダリティを必要とする新しいテストセットを導入し、モデルのパフォーマンスが劇的に低下することを観察しました。私たちの方法論は、マルチモーダルなデータセットに関する貴重な洞察を提供し、より堅牢なモデルの開発の必要性を強調します。

The prevalence of large-scale multimodal datasets presents unique challenges in assessing dataset quality. We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. Our method sheds light on the importance of different modalities in datasets, as well as the relationship between them. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. Moreover, we find that more than 70% of the questions are solvable using several different single-modality strategies, e.g., by either looking at the video or listening to the audio, highlighting the limited integration of multiple modalities in TVQA. We leverage our annotation and analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification. Based on our observations, we introduce a new test set that necessitates multiple modalities, observing a dramatic drop in model performance. Our methodology provides valuable insights into multimodal datasets and highlights the need for the development of more robust models.

updated: Thu Jul 06 2023 08:02:45 GMT+0000 (UTC)

published: Thu Jul 06 2023 08:02:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト