Visually-Grounded Descriptions Improve Zero-Shot Image Classification

Michael Ogezi; Bradley Hauer; Grzegorz Kondrak

視覚に基づいた説明によりゼロショット画像分類が向上

CLIP のような言語視覚モデルは、ゼロショット画像分類 (ZSIC) などのゼロショット視覚タスクにおいて大幅な進歩を遂げました。ただし、具体的で表現力豊かなクラス記述を生成することは依然として大きな課題です。既存のアプローチには粒度やラベルの曖昧さの問題があります。これらの課題に取り組むために、私たちは、最新の言語モデルとセマンティック知識ベースを活用して、視覚に基づいたクラス記述を作成する新しい方法である V-GLOSS: Visual Glosses を提案します。 ImageNet や STL-10 などのベンチマーク ZSIC データセットで最先端の結果を達成することで、V-GLOSS の有効性を実証します。さらに、V-GLOSS によって生成されたクラス記述を含む Silver データセットを紹介し、視覚タスクに対するその有用性を示します。コードとデータセットを利用可能にします。

Language-vision models like CLIP have made significant progress in zero-shot vision tasks, such as zero-shot image classification (ZSIC). However, generating specific and expressive class descriptions remains a major challenge. Existing approaches suffer from granularity and label ambiguity issues. To tackle these challenges, we propose V-GLOSS: Visual Glosses, a novel method leveraging modern language models and semantic knowledge bases to produce visually-grounded class descriptions. We demonstrate V-GLOSS's effectiveness by achieving state-of-the-art results on benchmark ZSIC datasets including ImageNet and STL-10. In addition, we introduce a silver dataset with class descriptions generated by V-GLOSS, and show its usefulness for vision tasks. We make available our code and dataset.

updated: Fri Jun 23 2023 16:29:51 GMT+0000 (UTC)

published: Mon Jun 05 2023 17:22:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト