Exploiting Category Names for Few-Shot Classification with Vision-Language Models

Taihong Xiao; Zirui Wang; Liangliang Cao; Jiahui Yu; Shengyang Dai; Ming-Hsuan Yang

視覚言語モデルによる少数ショット分類のためのカテゴリ名の活用

大規模データで事前トレーニングされた視覚言語基盤モデルは、多くの視覚的理解タスクのための強力なツールを提供します。特に、多くのビジョン言語モデルは、2 つのモダリティを同じ埋め込み空間にマッピングできる 2 つのエンコーダー (ビジュアルおよびテキスト) を構築します。その結果、学習された表現は、画像分類などのタスクで優れたゼロショットパフォーマンスを達成します。ただし、カテゴリごとに数例しかない場合、主に多数のパラメーターと比較的少量のトレーニングデータとの間のギャップが原因で、大規模な視覚言語モデルの可能性が十分に発揮されないことがよくあります。この論文では、カテゴリ名を使用して分類ヘッドを初期化することにより、少数ショット分類のパフォーマンスを大幅に改善できることを示しています。さらに興味深いことに、ランダムな初期化と比較して少数ショット分類のパフォーマンスを向上させるために、不完全なカテゴリ名、または外国語からの名前を借りることができます。提案されたカテゴリ名の初期化方法を使用して、私たちのモデルは、多数の少数ショット画像分類ベンチマークで最先端のパフォーマンスを取得します (たとえば、ImageNet で 87.37%、Stanford Cars で 96.08%、両方とも 5 ショット学習を使用)。）。また、カテゴリ名の利点が減少する時期と、小規模なモデルのパフォーマンスを向上させるために蒸留を使用する方法を調査および分析し、将来の研究のためのガイダンスを提供します。

Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks. Notably, many vision-language models build two encoders (visual and textual) that can map two modalities into the same embedding space. As a result, the learned representations achieve good zero-shot performance on tasks like image classification. However, when there are only a few examples per category, the potential of large vision-language models is often underperformed, mainly due to the gap between a large number of parameters and a relatively small amount of training data. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head. More interestingly, we can borrow the non-perfect category names, or even names from a foreign language, to improve the few-shot classification performance compared with random initialization. With the proposed category name initialization method, our model obtains the state-of-the-art performance on a number of few-shot image classification benchmarks (e.g., 87.37% on ImageNet and 96.08% on Stanford Cars, both using five-shot learning). We also investigate and analyze when the benefit of category names diminishes and how to use distillation to improve the performance of smaller models, providing guidance for future research.

updated: Sun Dec 04 2022 00:59:12 GMT+0000 (UTC)

published: Tue Nov 29 2022 21:08:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト