Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Zixian Guo; Bowen Dong; Zhilong Ji; Jinfeng Bai; Yiwen Guo; Wangmeng Zuo

マルチラベル画像認識のためのプロンプトチューニングにおける画像としてのテキスト

プロンプトチューニングは、大規模なビジョン言語の事前トレーニング済みモデル (CLIP など) をデータ制限またはラベル制限設定でさまざまなダウンストリームタスクに適応させる効率的な方法として採用されています。それにもかかわらず、視覚的なデータ (画像など) は、既存の方法でプロンプトを学習するためのデフォルトの前提条件です。この作業では、(CLIP のトレーニング用に) 2 つのモダリティを調整する際の画像とテキストの対照的な学習の有効性により、テキストを画像として扱い、プロンプトを調整し、TaI プロンプトを導入することがさらに可能になると主張します。視覚的なデータとは対照的に、テキストの説明は簡単に収集でき、それらのクラスラベルを直接派生させることができます。特に、TaI プロンプトをマルチラベル画像認識に適用します。そこでは、野生の文がプロンプトチューニング用の画像の代替として機能します。さらに、TaI を使用すると、マルチラベル認識パフォーマンスを向上させるために、粗粒度と細粒度の両方の埋め込みを抽出するために、2 重グレインプロンプトチューニング (TaI-DPT) がさらに提示されます。実験結果は、提案された TaI-DPT が、MS-COCO、VOC2007、NUS-WIDE などの複数のベンチマークでゼロショット CLIP よりも大幅に優れていることを示していますが、画像からプロンプトを表示して認識を改善する既存の方法と組み合わせることができます。さらにパフォーマンス。コードは https://github.com/guozix/TaI-DPT で公開されています。

Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. Code is released at https://github.com/guozix/TaI-DPT.

updated: Wed Nov 23 2022 07:00:11 GMT+0000 (UTC)

published: Wed Nov 23 2022 07:00:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト