Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Zhenlin Xu; Yi Zhu; Tiffany Deng; Abhay Mittal; Yanbei Chen; Manchen Wang; Paolo Favaro; Joseph Tighe; Davide Modolo

This paper introduces innovative benchmarks to evaluate Vision-Language Models (VLMs) in real-world zero-shot recognition tasks, focusing on the granularity and specificity of prompting text. We propose a unique evaluation protocol using adapted ImageNet and MS-COCO datasets to assess models' consistency in recognizing concepts at varying granularity levels and their sensitivity to the specificity of language inputs. Our extensive evaluation reveals that state-of-the-art VLMs, including contrastive models like CLIP, struggle with granularity and are sensitive to text specificity, impacting their effectiveness in open-world settings. This comprehensive study, a first in evaluating VLMs from these perspectives, provides valuable insights and tools for the community, highlighting the limitations and paving the way for enhanced models with better generalization in zero-shot recognition.

updated: Mon Jan 29 2024 10:45:58 GMT+0000 (UTC)

published: Wed Jun 28 2023 09:29:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト