Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Antoine Yang; Antoine Miech; Josef Sivic; Ivan Laptev; Cordelia Schmid

質問するだけ：何百万ものナレーション付き動画からの質問に答えることを学ぶ

視覚的な質問応答の最近の方法は、大規模な注釈付きデータセットに依存しています。ただし、ビデオの質問と回答を手動で注釈することは、面倒で費用がかかり、スケーラビリティが妨げられます。この作業では、手動の注釈を避け、自動クロスモーダル監視を利用してビデオ質問応答用の大規模なトレーニングデータセットを生成することを提案します。テキストデータでトレーニングされた質問生成トランスフォーマーを活用し、それを使用して、文字起こしされたビデオナレーションから質問と回答のペアを生成します。ナレーション付きのビデオが与えられると、69Mのビデオ-質問-回答トリプレットを使用してHowToVQA69Mデータセットが自動的に生成されます。このデータセット内の多様な回答のオープンボキャブラリーを処理するために、ビデオ質問マルチモーダルトランスフォーマーとアンサートランスフォーマーの間の対照的な損失に基づくトレーニング手順を提案します。ゼロショットVideoQAタスクを紹介し、特にまれな回答に対して優れた結果を示します。さらに、MSRVTT-QA、MSVD-QA、ActivityNet-QA、およびHow2QAの最新技術を大幅に上回る方法を示します。最後に、詳細な評価のために、言語バイアスが低減され、高品質の冗長な手動注釈が付いた新しいVideoQAデータセットを紹介します。コードとデータセットは、https：//antoyang.github.io/just-ask.htmlで公開されます。

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations. Our code and datasets will be made publicly available at https://antoyang.github.io/just-ask.html.

updated: Tue Mar 30 2021 14:33:37 GMT+0000 (UTC)

published: Tue Dec 01 2020 12:59:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト