VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Linjie Li; Jie Lei; Zhe Gan; Licheng Yu; Yen-Chun Chen; Rohit Pillai; Yu Cheng; Luowei Zhou; Xin Eric Wang; William Yang Wang; Tamara Lee Berg; Mohit Bansal; Jingjing Liu; Lijuan Wang; Zicheng Liu

価値：ビデオと言語の理解評価のためのマルチタスクベンチマーク

ほとんどの既存のビデオと言語（VidL）の研究は、単一のデータセット、または単一のタスクの複数のデータセットに焦点を当てています。実際には、真に有用なVidLシステムは、さまざまなタスク、ドメイン、およびデータセットに簡単に一般化できることが期待されています。このようなシステムの評価を容易にするために、ビデオと言語の理解評価（VALUE）ベンチマークを導入します。これは、3つの一般的なタスクにわたる11のVidLデータセットの集合です。（i）テキストからビデオへの検索。（ii）ビデオの質問応答。（iii）ビデオキャプション。 VALUEベンチマークは、幅広いビデオジャンル、ビデオの長さ、データ量、およびタスクの難易度をカバーすることを目的としています。 VALUEは、視覚的な情報のみを含む単一チャネルのビデオに焦点を合わせるのではなく、ビデオフレームとそれに関連する字幕の両方からの情報を活用するモデル、および複数のタスク間で知識を共有するモデルを推進します。大規模なVidL事前トレーニングがある場合とない場合のさまざまなベースライン方法を評価し、ビデオ入力チャネル、融合方法、およびさまざまなビデオ表現の影響を体系的に調査します。また、タスク間の転送可能性を研究し、さまざまな設定でマルチタスク学習を実施します。私たちの最高のモデルと人間のパフォーマンスの間の大きなギャップは、高度なVidLモデルの将来の研究を必要とします。 VALUEはhttps://value-benchmark.github.io/で入手できます。

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks. We evaluate various baseline methods with and without large-scale VidL pre-training, and systematically investigate the impact of video input channels, fusion methods, and different video representations. We also study the transferability between tasks, and conduct multi-task learning under different settings. The significant gap between our best model and human performance calls for future study for advanced VidL models. VALUE is available at https://value-benchmark.github.io/.

updated: Wed Aug 18 2021 21:55:27 GMT+0000 (UTC)

published: Tue Jun 08 2021 18:34:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト