I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

Muhammad Ferjad Naeem; Muhammad Gul Zain Ali Khan; Yongqin Xian; Muhammad Zeshan Afzal; Didier Stricker; Luc Van Gool; Federico Tombari

I2MVFormer: ゼロショット画像分類のための大規模言語モデル生成マルチビュードキュメント監視

最近の研究では、オンラインソースからの非構造化テキスト (ドキュメント) がゼロショット画像分類の有用な補助情報として役立つことが示されています。ただし、これらの方法では、ウィキペディアなどの高品質の情報源にアクセスする必要があり、単一の情報源に限定されます。 Web スケールのテキストでトレーニングされた大規模言語モデル (LLM) は、学習した知識をさまざまなタスクに転用する優れた能力を示しています。この作業では、LLM を使用して、ゼロショット画像分類モデルのテキスト監視を提供する新しい視点を提供します。 LLM には、例としてさまざまなアノテーターによるいくつかのテキスト記述が提供されています。 LLM はこれらの例を条件として、各クラス (ビューと呼ばれる) に対して複数のテキスト記述を生成します。私たちが提案したモデル I2MVFormer は、これらのクラスビューを使用して、ゼロショット画像分類のためのマルチビューセマンティック埋め込みを学習します。クラスの各テキストビューが補完的な情報を提供し、モデルが高度に識別可能なクラスの埋め込みを学習できるようにすることを示します。さらに、I2MVFormer は、ベースラインモデルと比較して、LLM からのマルチビューテキスト監視の使用に優れていることを示しています。 I2MVFormer は、教師なしセマンティック埋め込みによるゼロショット画像分類のための 3 つのパブリックベンチマークデータセットで新しい最先端技術を確立します。

Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class(referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.

updated: Mon Dec 05 2022 14:11:36 GMT+0000 (UTC)

published: Mon Dec 05 2022 14:11:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト