MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu; Haodong Duan; Yuanhan Zhang; Bo Li; Songyang Zhang; Wangbo Zhao; Yike Yuan; Jiaqi Wang; Conghui He; Ziwei Liu; Kai Chen; Dahua Lin

MMBench: あなたのマルチモーダルモデルは万能選手ですか?

大型の視覚言語モデルは最近目覚ましい進歩を遂げ、視覚情報に関する優れた認識能力と推論能力を示しています。しかし、これらの大規模なビジョン言語モデルを効果的に評価する方法は依然として大きな障害であり、将来のモデル開発を妨げています。 VQAv2 や COCO Caption などの従来のベンチマークは、定量的なパフォーマンス測定を提供しますが、きめ細かい能力評価や堅牢でない評価指標が欠如しているという問題があります。 OwlEval などの最近の主観的なベンチマークは、人間の作業を組み込むことによってモデルの能力の包括的な評価を提供しますが、拡張性がなく、重大なバイアスが表示されます。これらの課題に応えて、私たちは新しいマルチモダリティベンチマークである MMBench を提案します。 MMBench は、主に 2 つの要素で構成される包括的な評価パイプラインを体系的に開発します。最初の要素は、評価の質問と能力の数と多様性の点で、既存の同様のベンチマークを上回る、細心の注意を払って厳選されたデータセットです。 2 番目の要素では、新しい CircularEval 戦略を導入し、ChatGPT の使用を組み込みます。この実装は、自由形式の予測を事前定義された選択肢に変換するように設計されており、それによってモデルの予測のより堅牢な評価が容易になります。 MMBench は、視覚言語モデルのさまざまな能力を確実に評価するために体系的に設計された客観的なベンチマークです。 MMBench が研究コミュニティによるモデルの評価を改善し、この分野の将来の進歩を促進することを願っています。プロジェクトページ: https://opencompass.org.cn/mmbench。

Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench.

updated: Mon Apr 29 2024 15:21:19 GMT+0000 (UTC)

published: Wed Jul 12 2023 16:23:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト