Visual Classification via Description from Large Language Models

Sachit Menon; Carl Vondrick

大規模言語モデルからの記述による視覚的分類

CLIP などの視覚言語モデル (VLM) は、標準的なゼロショット分類手順 (クエリ画像と各カテゴリの埋め込み単語との類似度の計算) を使用して、さまざまな認識タスクで有望なパフォーマンスを示しています。カテゴリ名のみを使用することで、言語が提供する追加情報の豊富なコンテキストを利用することを怠っています。この手順では、カテゴリが選択された理由について中間的な理解が得られず、さらに、この決定に使用される基準を調整するメカニズムも提供されません。説明による分類と呼ばれる、VLM による分類の代替フレームワークを提示します。 VLM には、広範なカテゴリではなく、説明的な特徴をチェックするように求めています。トラを見つけるには、縞模様を探します。その爪;もっと。これらの記述子に基づいて決定することにより、使用したい機能の使用を促進する追加の手がかりを提供できます。その過程で、モデルが決定を構築するために使用する機能について明確なアイデアを得ることができます。ある程度の固有の説明可能性が得られます。これらの記述子をスケーラブルな方法で取得するために、大規模な言語モデル (GPT-3 など) にクエリを実行します。広範な実験により、私たちのフレームワークには、解釈可能性を超えた多くの利点があることが示されています。分布シフト全体で ImageNet の精度が向上することを示しています。トレーニング中に見えない概念を認識するために VLM を適応させる能力を実証します。ベースラインと比較して偏りを効果的に軽減するために記述子を編集する方法を示します。

Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.

updated: Thu Oct 13 2022 17:03:46 GMT+0000 (UTC)

published: Thu Oct 13 2022 17:03:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト