TouchStone: Evaluating Vision-Language Models by Language Models

Shuai Bai; Shusheng Yang; Jinze Bai; Peng Wang; Xingxuan Zhang; Junyang Lin; Xinggang Wang; Chang Zhou; Jingren Zhou

TouchStone: 言語モデルによる視覚言語モデルの評価

大規模視覚言語モデル (LVLM) は最近急速な進歩を遂げており、視覚受容体を大規模言語モデル (LLM) に接続することで視覚情報を知覚、理解、処理する驚くべき能力を示しています。しかし、現在の評価は主に認識能力と推論能力に焦点を当てており、会話能力の直接的な評価は欠如しており、視覚的なストーリーテリング能力は無視されています。本稿では、LVLMのさまざまな能力を総合的に評価するために、強いLLMを審査員として採用する評価手法を提案する。まず、オープンワールドの画像と質問で構成され、5 つの主要な能力カテゴリと 27 のサブタスクをカバーする包括的な視覚対話データセット TouchStone を構築します。このデータセットは、基本的な認識と理解をカバーするだけでなく、文学創作にも及びます。次に、詳細な画像注釈を統合することで、マルチモーダル入力コンテンツを LLM が理解できる形式に効果的に変換します。これにより、高度な LLM を使用して、人間の介入を必要とせずにマルチモーダル対話の品質を直接評価できるようになります。検証を通じて、GPT-4 などの強力な LVLM が、テキスト機能のみを活用して人間の好みに合わせて対話の品質を効果的にスコアリングできることを実証しました。私たちの取り組みが LVLM の評価の試金石となり、より強力な LVLM の構築への道を開くことを願っています。評価コードは https://github.com/OFA-Sys/TouchStone で入手できます。

Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers fundamental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal input content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs' evaluation and pave the way for building stronger LVLMs. The evaluation code is available at https://github.com/OFA-Sys/TouchStone.

updated: Mon Sep 04 2023 15:06:15 GMT+0000 (UTC)

published: Thu Aug 31 2023 17:52:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト