Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Antoine Yang; Antoine Miech; Josef Sivic; Ivan Laptev; Cordelia Schmid

質問するだけ：何百万ものナレーション付き動画からの質問に答えることを学ぶ

視覚的な質問応答への最新のアプローチでは、トレーニング用に大きな注釈付きデータセットが必要です。ただし、ビデオの質問と回答を手動で注釈することは、面倒で費用がかかり、スケーラビリティを妨げます。この作業では、手動の注釈を避け、すぐに利用できる何百万ものナレーション付きビデオからビデオ質問応答（VideoQA）を学習することを提案します。最先端のテキストトランスフォーマーパイプラインを活用して、文字起こしされたビデオナレーションから質問と回答のペアを自動的に生成し、新しい大規模なVideoQAトレーニングデータセットを取得することを提案します。このデータセット内の多様な回答のオープンボキャブラリーを処理するために、ビデオ質問マルチモーダルトランスフォーマーと回答埋め込みの間の対照的な損失に基づくトレーニング手順を提案します。ゼロショットVideoQAタスクでモデルを評価し、特にまれな回答に対して優れた結果を示します。さらに、ターゲットデータセットでモデルを微調整することは、MSRVTT-QA、MSVD-QA、およびActivityNet-QAの最新技術を大幅に上回っていることを示しています。最後に、詳細な評価のために、言語バイアスが低減され、高品質の注釈が付けられた、手動で注釈が付けられた新しいVideoQAデータセットを紹介します。私たちのコードとデータセットは、https：//www.di.ens.fr/willow/research/just-ask/で公開されます。

Modern approaches to visual question answering require large annotated datasets for training. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and to learn video question answering (VideoQA) from millions of readily-available narrated videos. We propose to automatically generate question-answer pairs from transcribed video narrations leveraging a state-of-the-art text transformer pipeline and obtain a new large-scale VideoQA training dataset. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer embedding. We evaluate our model on the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate that finetuning our model on target datasets significantly outperforms the state of the art on MSRVTT-QA, MSVD-QA and ActivityNet-QA. Finally, for a detailed evaluation we introduce a new manually annotated VideoQA dataset with reduced language biases and high quality annotations. Our code and datasets will be made publicly available at https://www.di.ens.fr/willow/research/just-ask/ .

updated: Tue Dec 01 2020 12:59:20 GMT+0000 (UTC)

published: Tue Dec 01 2020 12:59:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト