Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models

Chengcheng Ma; Yang Liu; Jiankang Deng; LingXi Xie; Weiming Dong; Changsheng Xu

視覚言語モデルの迅速な調整におけるオーバーフィッティングの理解と軽減

CLIP などの事前トレーニング済みのビジョン言語モデル (VLM) は、適切なテキストプロンプトを使用して、下流のビジョンタスクで印象的な一般化機能を示しています。プロンプトを手動で設計する代わりに、コンテキスト最適化 (CoOp) が最近提案され、タスク固有のトレーニングデータを使用して継続的なプロンプトを学習します。ダウンストリームタスクのパフォーマンスが向上したにもかかわらず、いくつかの研究では、CoOp が次の 2 つの側面でオーバーフィッティングの問題に悩まされていることが報告されています。 (ii) 新しいクラスのテスト精度は低下し続けます。ただし、既存の研究のいずれも、このようなオーバーフィッティングの問題を効果的に理解し、軽減することはできません。この論文では、最初に勾配流を分析することにより、オーバーフィッティングの原因を探ります。比較実験により、CoOp はトレーニングの初期段階と後期段階でそれぞれ一般化可能な機能と偽の機能を優先し、過学習と過学習の現象につながることが明らかになりました。これらの観察結果を考慮して、サブスペースプロンプトチューニング (SubPT) を提案し、トレーニングプロセス全体で初期段階の勾配フロー固有ベクトルがまたがる低ランク部分空間に逆伝播の勾配を投影し、オーバーフィッティングの問題を首尾よく排除します。さらに、CoOp に Novel Feature Learner (NFL) を装備して、画像トレーニングデータを必要とせずに、学習したプロンプトをトレーニングセットを超えた新しいカテゴリに一般化する能力を強化します。 11 の分類データセットに関する広範な実験により、SubPT+NFL が一貫して CoOp のパフォーマンスを向上させ、最先端のアプローチである CoCoOp よりも優れていることが実証されています。オープン語彙オブジェクト検出やゼロショットセマンティックセグメンテーションなど、より困難なビジョンダウンストリームタスクの実験でも、提案された方法の有効性が検証されます。コードは https://tinyurl.com/mpe64f89 にあります。

Pre-trained Vision-Language Models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous prompts using task-specific training data. Despite the performance improvements on downstream tasks, several studies have reported that CoOp suffers from the overfitting issue in two aspects: (i) the test accuracy on base classes first gets better and then gets worse during training; (ii) the test accuracy on novel classes keeps decreasing. However, none of the existing studies can understand and mitigate such overfitting problem effectively. In this paper, we first explore the cause of overfitting by analyzing the gradient flow. Comparative experiments reveal that CoOp favors generalizable and spurious features in the early and later training stages respectively, leading to the non-overfitting and overfitting phenomenon. Given those observations, we propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process, and successfully eliminate the overfitting problem. Besides, we equip CoOp with Novel Feature Learner (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set, needless of image training data. Extensive experiments on 11 classification datasets demonstrate that SubPT+NFL consistently boost the performance of CoOp and outperform the state-of-the-art approach CoCoOp. Experiments on more challenging vision downstream tasks including open-vocabulary object detection and zero-shot semantic segmentation also verify the effectiveness of the proposed method. Codes can be found at https://tinyurl.com/mpe64f89.

updated: Fri Nov 04 2022 02:06:22 GMT+0000 (UTC)

published: Fri Nov 04 2022 02:06:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト