Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

Renrui Zhang; Xiangfei Hu; Bohao Li; Siyuan Huang; Hanqiu Deng; Hongsheng Li; Yu Qiao; Peng Gao

プロンプト、生成、キャッシュ: 基礎モデルのカスケードが強力な少数ショット学習器を作る

低データ領域での視覚認識には、限られたトレーニングサンプルから一般化された表現を学習するためのディープニューラルネットワークが必要です。最近、CLIP ベースの方法は、対照的な言語イメージの事前トレーニングの恩恵を受ける有望な少数ショットのパフォーマンスを示しました。次に、より多様なトレーニング前の知識をカスケードして、少数ショットの表現学習をさらに支援できるかどうかを質問します。このホワイトペーパーでは、CaFo を提案します。CaFo は、より優れた少数ショット学習のためのさまざまな事前トレーニングパラダイムの多様な事前知識を組み込んだ Foundation モデルのカスケードです。当社の CaFo には、CLIP の言語対比知識、DINO の視覚対比知識、DALL-E の視覚生成知識、および GPT-3 の言語生成知識が組み込まれています。具体的には、CaFo は「Prompt, Generate, then Cache」で動作します。まず、GPT-3 を活用して、豊富なダウンストリーム言語セマンティクスで CLIP をプロンプトするためのテキスト入力を生成します。次に、DALL-E を介して合成画像を生成し、マンパワーなしで少数ショットのトレーニングデータを拡張します。最後に、学習可能なキャッシュモデルを導入して、CLIP と DINO からの予測を適応的にブレンドします。このようなコラボレーションにより、CaFo はさまざまな事前トレーニング方法の可能性を完全に解き放ち、それらを統合して最先端の少数ショット分類を実行できます。コードは https://github.com/ZrrSkywalker/CaFo で入手できます。

Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training. We then question, if the more diverse pre-training knowledge can be cascaded to further assist few-shot representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-training paradigms for better few-shot learning. Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly, we leverage GPT-3 to produce textual inputs for prompting CLIP with rich downstream linguistic semantics. Then, we generate synthetic images via DALL-E to expand the few-shot training data without any manpower. At last, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such collaboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification. Code is available at https://github.com/ZrrSkywalker/CaFo.

updated: Fri Mar 03 2023 18:58:16 GMT+0000 (UTC)

published: Fri Mar 03 2023 18:58:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト