CPL: Counterfactual Prompt Learning for Vision and Language Models

Xuehai He; Diji Yang; Weixi Feng; Tsu-Jui Fu; Arjun Akula; Varun Jampani; Pradyumna Narayana; Sugato Basu; William Yang Wang; Xin Eric Wang

CPL: 視覚および言語モデルのための反事実プロンプト学習

プロンプトチューニングは、事前にトレーニングされたビジョンおよび CLIP などの言語モデルの学習可能なプロンプトのみをチューニングする、新しい少数ショット転送学習手法です。ただし、既存の迅速な調整方法は、誤った表現や絡み合った表現を学習する傾向があり、目に見えない概念への一般化が不十分になります。限られた例からの偽りのない効率的なプロンプト学習に向けて、この論文では、共同最適化フレームワークで反事実生成と対照学習を同時に採用する、視覚モデルと言語モデルの新しい反事実プロンプト学習 (CPL) メソッドを紹介します。特に、CPL は、概念の変更を引き起こす、意味的に類似したポジティブサンプルとネガティブサンプルの間の最小限の非スプリアスな特徴の変化を特定することによって反事実を構築し、対照学習を介して事実と反事実の両方の例からより一般化可能なプロンプト表現を学習します。広範な実験により、CPL は、CLIP での以前の迅速なチューニング方法よりも、さまざまな視覚および言語タスクで優れた少数ショットのパフォーマンスを得ることができることが示されています。画像分類では、7 つのデータセット全体で、目に見えないクラスで平均 3.55% の相対的な改善を達成しました。画像テキストの検索と視覚的な質問への回答では、目に見えないテストセットの 3 つの少数ショットシナリオで、それぞれ最大 4.09% と 25.08% の相対的な改善が得られます。

Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel Counterfactual Prompt Learning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09% and 25.08% relative improvements across three few-shot scenarios on unseen test sets respectively.

updated: Wed Oct 19 2022 08:06:39 GMT+0000 (UTC)

published: Wed Oct 19 2022 08:06:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト