Enabling Calibration In The Zero-Shot Inference of Large Vision-Language Models

Will LeVine; Benjamin Pikus; Pranav Raja; Fernando Amat Gil

大規模な視覚言語モデルのゼロショット推論でキャリブレーションを有効にする

深層学習モデルのキャリブレーションは、その信頼性と安全な使用に不可欠であるため、ミスキャリブレーションを減らすように作成された方法を使用して、教師あり分類モデルで広く研究されています。ただし、CLIP のようなゼロショット推論に使用される視覚言語モデルのキャリブレーションに関する包括的な研究はまだ行われていません。プロンプト、データセット、アーキテクチャなどの関連する変数全体でキャリブレーションを測定し、CLIP を使用したゼロショット推論が正しくキャリブレーションされていないことを発見しました。さらに、ゼロショット推論モデルとして CLIP の一般的なユースケースに合わせた温度スケーリングの修正バージョンを提案し、単一の学習温度が特定の CLIP モデルごとに一般化されることを示します (選択された事前トレーニングデータセットによって定義されます)。およびアーキテクチャ) を推論データセットとプロンプトの選択全体にわたって提供します。

Calibration of deep learning models is crucial to their trustworthiness and safe usage, and as such, has been extensively studied in supervised classification models, with methods crafted to decrease miscalibration. However, there has yet to be a comprehensive study of the calibration of vision-language models that are used for zero-shot inference, like CLIP. We measure calibration across relevant variables like prompt, dataset, and architecture, and find that zero-shot inference with CLIP is miscalibrated. Furthermore, we propose a modified version of temperature scaling that is aligned with the common use cases of CLIP as a zero-shot inference model, and show that a single learned temperature generalizes for each specific CLIP model (defined by a chosen pre-training dataset and architecture) across inference dataset and prompt choice.

updated: Tue Apr 18 2023 18:28:51 GMT+0000 (UTC)

published: Sat Mar 11 2023 17:14:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト