Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+29.4% mIoU on Pascal VOC, +13.2% mIoU on Pascal Context, +14.0% mIoU on MS COCO, and +11.4% mIoU on ADE-20K.) and even outperforms most baselines that conduct additional network training on top of pretrained VLMs. Our codebase is at
updated: Fri Mar 29 2024 02:18:40 GMT+0000 (UTC)
published: Tue Nov 28 2023 06:42:58 GMT+0000 (UTC)
