VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Linjie Li; Jie Lei; Zhe Gan; Licheng Yu; Yen-Chun Chen; Rohit Pillai; Yu Cheng; Luowei Zhou; Xin Eric Wang; William Yang Wang; Tamara Lee Berg; Mohit Bansal; Jingjing Liu; Lijuan Wang; Zicheng Liu

価値: ビデオと言語の理解評価のためのマルチタスクベンチマーク

ほとんどの既存のビデオと言語 (VidL) の研究は、単一のデータセット、または単一のタスクの複数のデータセットに焦点を当てています。実際には、真に有用な VidL システムは、さまざまなタスク、ドメイン、およびデータセットに簡単に一般化できることが期待されています。このようなシステムの評価を容易にするために、ビデオと言語の理解評価 (VALUE) ベンチマークを導入します。これは、3 つの一般的なタスクでの 11 の VidL データセットの集合です。(i) テキストからビデオへの取得。 (ii) ビデオの質問応答。 (iii) ビデオのキャプション。 VALUE ベンチマークは、幅広い動画のジャンル、動画の長さ、データ量、タスクの難易度をカバーすることを目的としています。 VALUE は、視覚情報のみを含む単一チャンネルのビデオに焦点を当てるのではなく、ビデオフレームとそれに関連付けられたサブタイトルの両方からの情報を活用するモデル、および複数のタスクで知識を共有するモデルを推進しています。大規模な VidL 事前トレーニングの有無にかかわらず、さまざまなベースライン方法を評価し、ビデオ入力チャネル、フュージョン方法、およびさまざまなビデオ表現の影響を体系的に調査します。また、タスク間の転移性についても研究し、異なる設定でマルチタスク学習を行います。私たちの最高のモデルと人間のパフォーマンスの間の大きなギャップは、高度な VidL モデルの将来の研究を必要とします。 VALUE は https://value-leaderboard.github.io/ で入手できます。

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks. We evaluate various baseline methods with and without large-scale VidL pre-training, and systematically investigate the impact of video input channels, fusion methods, and different video representations. We also study the transferability between tasks, and conduct multi-task learning under different settings. The significant gap between our best model and human performance calls for future study for advanced VidL models. VALUE is available at https://value-leaderboard.github.io/.

updated: Tue Jun 08 2021 18:34:21 GMT+0000 (UTC)

published: Tue Jun 08 2021 18:34:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト