MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu; Zhengyuan Yang; Linjie Li; Jianfeng Wang; Kevin Lin; Zicheng Liu; Xinchao Wang; Lijuan Wang

MM-Vet: 統合機能のための大規模マルチモーダルモデルの評価

我々は、複雑なマルチモーダルタスクに関する大規模マルチモーダルモデル（LMM）を検査する評価ベンチマークであるMM-Vetを提案します。最近の LMM は、黒板に書かれた数学の問題を解く、ニュース映像の中の出来事や有名人について推論する、視覚的なジョークを説明するなど、さまざまな興味深い能力を示しています。モデルの急速な進歩により、評価ベンチマークの開発に課題が生じています。問題は次のとおりです。(1) 複雑な複合タスクを体系的に構造化して評価する方法。 (2) 質問と回答の種類全体で適切に機能する評価指標を設計する方法。 (3) 単純なパフォーマンスのランキングを超えた洞察をモデルに提供する方法。この目的を達成するために、複雑なタスクを解決する興味深い能力は、さまざまなコアビジョン言語 (VL) 機能を統合できるジェネラリストモデルによって実現されることが多いという洞察に基づいて設計された MM-Vet を紹介します。 MM-Vet は 6 つのコア VL 機能を定義し、機能の組み合わせから得られる 16 の重要な統合を検査します。評価メトリクスについては、オープンエンド出力用の LLM ベースの評価器を提案します。エバリュエーターを使用すると、さまざまな質問タイプや回答スタイルにわたる評価が可能になり、結果として統一されたスコアリング指標が得られます。 MM-Vet で代表的な LMM を評価し、さまざまな LMM システムパラダイムとモデルの機能についての洞察を提供します。コードとデータは https://github.com/yuweihao/MM-Vet で入手できます。

We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models. Code and data are available at https://github.com/yuweihao/MM-Vet.

updated: Tue Oct 24 2023 07:59:31 GMT+0000 (UTC)

published: Fri Aug 04 2023 17:59:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト