VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

Yonatan Bitton; Hritik Bansal; Jack Hessel; Rulin Shao; Wanrong Zhu; Anas Awadalla; Josh Gardner; Rohan Taori; Ludwig Schmidt

VisIT-Bench: 現実世界の使用からインスピレーションを得た視覚言語指導のベンチマーク

実世界で使用するための命令に従うビジョン言語モデルを評価するためのベンチマークである VisIT-Bench (Visual InsTruction Benchmark) を紹介します。私たちの出発点は、命令に調整された視覚言語モデルが対応できるはずだと想定している 70 の「命令ファミリー」を厳選することです。 VQAv2 や COCO などの評価を超えて、タスクは基本的な認識からゲームのプレイ、クリエイティブな生成まで多岐にわたります。キュレーションの後、私たちのデータセットは 592 個のテストクエリで構成され、それぞれに人間が作成した命令条件付きキャプションが付いています。これらの説明は、指示に特有の要素を表面化します。たとえば、車椅子ユーザーに対する店頭のアクセシビリティについて尋ねる指示の場合、指示条件付きキャプションはスロープ/潜在的な障害物について説明します。これらの記述により、1) 各インスタンスについて人間が検証した参照出力を収集できます。 2) 人間の判断と一致する、テキストのみの LLM を使用したマルチモーダル世代の候補の自動評価。私たちは人間による評価と自動評価の両方を使用して、モデルとリファレンス間の品質ギャップを定量化します。たとえば、最高のパフォーマンスを誇る命令追従モデルは、比較のわずか 27% で GPT-4 リファレンスに勝利します。 VisIT-Bench は動的に参加できるため、実践者はモデルの応答をプロジェクト Web サイトに送信するだけです。データ、コード、リーダーボードは、visit-bench.github.io で入手できます。

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27% of the comparison. VisIT-Bench is dynamic to participate, practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at visit-bench.github.io.

updated: Fri Nov 17 2023 18:39:46 GMT+0000 (UTC)

published: Sat Aug 12 2023 15:27:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト