LOVM: Language-Only Vision Model Selection

Orr Zohar; Shih-Cheng Huang; Kuan-Chieh Wang; Serena Yeung

LOVM: 言語のみのビジョンモデルの選択

事前トレーニングされたマルチモーダルビジョン言語モデル (VLM) は、下流のビジョンアプリケーション、特に少数ショットおよびゼロショット設定での優れたパフォーマンスにより、ますます人気が高まっています。ただし、一部のダウンストリームアプリケーションに対して最高のパフォーマンスの VLM を選択することは、データセットとタスクに依存するため、簡単ではありません。一方、新しいアプリケーションで利用可能なすべての VLM を徹底的に評価するには、時間と計算量がかかるだけでなく、評価用のラベル付きデータセットの収集も必要になります。オープンソース VLM バリアントの数が増加するにつれて、厳選された評価データセットへのアクセスを必要としない効率的なモデル選択戦略が必要になります。この論文では、ダウンストリームタスクデータセットにアクセスせずに、ダウンストリームアプリケーションで VLM のゼロショットパフォーマンスを効率的に評価するための新しいタスクとベンチマークを提案します。具体的には、新しいタスク LOVM: 言語のみのビジョンモデル選択を導入します。このタスクでは、メソッドが、目的の下流アプリケーションのテキスト記述のみに基づいてモデル選択とパフォーマンス予測の両方を実行することが期待されます。次に、35 個の事前トレーニング済み VLM と 23 個のデータセットのグラウンドトゥルース評価で構成される広範な LOVM ベンチマークを導入しました。このベンチマークでは、事前トレーニング済み VLM をランク付けし、ゼロショットパフォーマンスを予測する方法が期待されています。

Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings. However, selecting the best-performing VLM for some downstream applications is non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive evaluation of all available VLMs on a novel application is not only time and computationally demanding but also necessitates the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. This paper proposes a novel task and benchmark for efficiently evaluating VLMs' zero-shot performance on downstream applications without access to the downstream task dataset. Specifically, we introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets, where methods are expected to rank the pre-trained VLMs and predict their zero-shot performance.

updated: Thu Jun 15 2023 06:53:05 GMT+0000 (UTC)

published: Thu Jun 15 2023 06:53:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト