Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness

Zhenlin Xu; Yi Zhu; Tiffany Deng; Abhay Mittal; Yanbei Chen; Manchen Wang; Paolo Favaro; Joseph Tighe; Davide Modolo

視覚言語モデルによるゼロショット認識の課題: 細分性と正確性

この論文では、CLIP などの対照的な視覚言語モデルに焦点を当て、オープンワールド環境におけるゼロショット視覚認識タスクに視覚言語モデル (VLM) を適用する際の課題を調査します。まず、さまざまな粒度レベルの概念に基づいて VLM のパフォーマンスを調べます。我々は、2 つの実験設定の下でパフォーマンスの差異を公平に評価する方法を提案し、VLM が粒度の細かい概念を認識するのが優れていることを発見しました。さらに、VLM からの類似性スコアは、視覚入力に対するテキスト入力の正確さを厳密に反映していないことがわかりました。スコアがより有益な記述に偏っている可能性があり、埋め込み間の類似性スコアの性質により、VLM が類似しているが間違っている記述間の正しさを認識することが困難になるという仮説をテストするための評価プロトコルを提案します。私たちの研究は、オープンワールド設定で VLM を使用する際の課題を浮き彫りにし、ゼロショット機能を向上させるための将来の研究の方向性を示唆しています。

This paper investigates the challenges of applying vision-language models (VLMs) to zero-shot visual recognition tasks in an open-world setting, with a focus on contrastive vision-language models such as CLIP. We first examine the performance of VLMs on concepts of different granularity levels. We propose a way to fairly evaluate the performance discrepancy under two experimental setups and find that VLMs are better at recognizing fine-grained concepts. Furthermore, we find that the similarity scores from VLMs do not strictly reflect the correctness of the textual inputs given visual input. We propose an evaluation protocol to test our hypothesis that the scores can be biased towards more informative descriptions, and the nature of the similarity score between embedding makes it challenging for VLMs to recognize the correctness between similar but wrong descriptions. Our study highlights the challenges of using VLMs in open-world settings and suggests directions for future research to improve their zero-shot capabilities.

updated: Wed Jun 28 2023 09:29:06 GMT+0000 (UTC)

published: Wed Jun 28 2023 09:29:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト