Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Jishnu Jaykumar P; Kamalesh Palanisamy; Yu-Wei Chao; Xinya Du; Yu Xiang

Proto-CLIP: 少数ショット学習のための視覚言語プロトタイプネットワーク

CLIP などの大規模視覚言語モデルを活用した、少数ショット学習のための新しいフレームワークを提案します。数ショット学習のための単峰性プロトタイプネットワークを動機として、数ショット学習に画像プロトタイプとテキストプロトタイプを利用する PROTO-CLIP を紹介します。具体的には、PROTO-CLIP は、数ショットのサンプルを使用して、CLIP の画像エンコーダーとテキストエンコーダーを共同で調整します。 2 つのエンコーダーは、分類用の画像クラスのプロトタイプを計算するために使用されます。適応中に、対応するクラスの画像とテキストのプロトタイプを調整することを提案します。このように提案された位置合わせは、両方のタイプのプロトタイプからの貢献により、少数ショットの分類に有益です。私たちは、少数ショット学習のためのベンチマークデータセットと、現実世界でのロボット認識のための実験を実施することにより、私たちの方法の有効性を実証します。

We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by the unimodal prototypical networks for few-shot learning, we introduce PROTO-CLIP that utilizes image prototypes and text prototypes for few-shot learning. Specifically, PROTO-CLIP adapts the image encoder and text encoder in CLIP in a joint fashion using few-shot examples. The two encoders are used to compute prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of corresponding classes. Such a proposed alignment is beneficial for few-shot classification due to the contributions from both types of prototypes. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning as well as in the real world for robot perception.

updated: Thu Jul 06 2023 15:41:53 GMT+0000 (UTC)

published: Thu Jul 06 2023 15:41:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト