GEM: A General Evaluation Benchmark for Multimodal Tasks

Lin Su; Nan Duan; Edward Cui; Lei Ji; Chenfei Wu; Huaishao Luo; Yongfei Liu; Ming Zhong; Taroon Bharti; Arun Sacheti

GEM：マルチモーダルタスクの一般的な評価ベンチマーク

このホワイトペーパーでは、マルチモーダルタスクの一般的な評価ベンチマークとしてGEMを紹介します。 GEMは、主に自然言語タスクに焦点を当てたGLUE、SuperGLUE、XGLUE、XTREMEなどの既存のデータセットとは異なり、画像言語タスク用のGEM-Iとビデオ用のGEM-Vで構成される大規模な視覚言語ベンチマークです。言語タスク。画像言語タスク用のMSCOCOやFlicker30K、ビデオ言語タスク用のYouCook2やMSR-VTTなどの既存のマルチモーダルデータセットと比較すると、GEMは、画像言語タスクとビデオ言語タスクを同時にカバーする最大の視覚言語データセットであるだけではありません。時間だけでなく、複数の言語でラベル付けされています。また、このベンチマーク用に2つのベースラインモデルを提供します。多言語マルチモーダル研究の発展を目指して、データセット、コード、ベースラインモデルをリリースします。

In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE, SuperGLUE, XGLUE and XTREME that mainly focus on natural language tasks, GEM is a large-scale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages. We also provide two baseline models for this benchmark. We will release the dataset, code and baseline models, aiming to advance the development of multilingual multimodal research.

updated: Fri Jun 18 2021 03:14:13 GMT+0000 (UTC)

published: Fri Jun 18 2021 03:14:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト