Learning to Answer Visual Questions from Web Videos

Antoine Yang; Antoine Miech; Josef Sivic; Ivan Laptev; Cordelia Schmid

Webビデオから視覚的な質問に答えることを学ぶ

視覚的な質問応答の最近の方法は、大規模な注釈付きデータセットに依存しています。ただし、ビデオの質問と回答を手動で注釈することは、面倒で費用がかかり、スケーラビリティを妨げます。この作業では、手動の注釈を避け、自動クロスモーダル監視を利用してビデオ質問応答用の大規模なトレーニングデータセットを生成することを提案します。テキストデータでトレーニングされた質問生成トランスフォーマーを活用し、それを使用して、文字起こしされたビデオナレーションから質問と回答のペアを生成します。ナレーション付きのビデオが与えられると、69Mのビデオ-質問-回答のトリプレットを使用してHowToVQA69Mデータセットが自動的に生成されます。このデータセット内の多様な回答のオープンボキャブラリーを処理するために、ビデオ質問マルチモーダルトランスフォーマーとアンサートランスフォーマーの間の対照的な損失に基づくトレーニング手順を提案します。ゼロショットVideoQAタスクとVideoQA機能プローブ評価設定を紹介し、特にまれな回答に対して優れた結果を示します。さらに、私たちの方法は、MSRVTT-QA、ActivityNet-QA、MSVD-QA、およびHow2QAデータセットで競争力のある結果を達成します。また、VideoQAデータセット生成アプローチが、Webビデオおよびテキストデータの別のソースに一般化されることも示しています。この方法を使用して、WebVidデータセットからデータセット、つまり代替テキスト注釈付きのビデオを生成し、VideoQAモデルのトレーニングに対するその利点を示します。最後に、詳細な評価のために、言語バイアスが低減され、高品質の手動注釈が付けられた新しいVideoQAデータセットを紹介します。コード、データセット、トレーニング済みモデルはhttps://antoyang.github.io/just-ask.htmlで入手できます。

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results, in particular for rare answers. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA dataset generation approach generalizes to another source of web video and text data. We use our method to generate the dataset from the WebVid dataset, i.e., videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce , a new VideoQA dataset with reduced language bias and high-quality manual annotations. Code, datasets and trained models are available at https://antoyang.github.io/just-ask.html

updated: Tue May 10 2022 16:34:26 GMT+0000 (UTC)

published: Tue May 10 2022 16:34:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト