Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models

Zangwei Zheng; Mingyuan Ma; Kai Wang; Ziheng Qin; Xiangyu Yue; Yang You

視覚言語モデルの継続的学習におけるゼロショット転送劣化の防止

継続的学習 (CL) は、再トレーニングなしで、事前トレーニング済みのビジョン言語モデルを新しいデータ分布またはトレーニング不足のデータ分布に効率的に適応させるのに役立ちます。それにもかかわらず、Contrastive Language-Image Pre-training (CLIP) モデルの継続的なトレーニング中に、壊滅的な忘却により、モデルのゼロショット転送能力が大幅に低下することが観察されました。既存の CL メソッドは、以前のデータを再生することで忘却を軽減できます。ただし、CLIP データセットは非公開であるため、リプレイメソッドは事前トレーニングデータセットにアクセスできません。さらに、以前に学習したダウンストリームタスクのデータを再生すると、パフォーマンスが向上しますが、ゼロショットパフォーマンスが犠牲になります。この課題に対処するために、機能空間とパラメーター空間の両方で視覚言語モデルの継続的な学習におけるゼロショット転送の低下を防ぐための新しい方法 ZSCL を提案します。特徴空間では、現在のモデルと初期のモデルの間の抽出のために参照データセットが導入されます。参照データセットにはセマンティックの多様性が必要ですが、ラベルを付けたり、事前トレーニングで見たり、画像とテキストのペアを一致させたりする必要はありません。パラメータ空間では、トレーニング中に重みを平均化することで、大きなパラメータシフトを防ぎます。タスクが単一のデータセットでクラス分離されているのではなく、さまざまなドメインからのものであるさまざまな方法を評価するために、より挑戦的なマルチドメインタスクインクリメンタルラーニング (MTIL) ベンチマークを提案します。私たちの方法は、従来のクラス単位の増分学習設定と MTIL で他の方法よりも平均スコア 9.7% 優れています。コードは https://github.com/Thunderbeee/ZSCL にあります。

Continual learning (CL) can help pre-trained vision-language models efficiently adapt to new or under-trained data distributions without re-training. Nevertheless, during the continual training of the Contrastive Language-Image Pre-training (CLIP) model, we observe that the model's zero-shot transfer ability significantly degrades due to catastrophic forgetting. Existing CL methods can mitigate forgetting by replaying previous data. However, since the CLIP dataset is private, replay methods cannot access the pre-training dataset. In addition, replaying data of previously learned downstream tasks can enhance their performance but comes at the cost of sacrificing zero-shot performance. To address this challenge, we propose a novel method ZSCL to prevent zero-shot transfer degradation in the continual learning of vision-language models in both feature and parameter space. In the feature space, a reference dataset is introduced for distillation between the current and initial models. The reference dataset should have semantic diversity but no need to be labeled, seen in pre-training, or matched image-text pairs. In parameter space, we prevent a large parameter shift by averaging weights during the training. We propose a more challenging Multi-domain Task Incremental Learning (MTIL) benchmark to evaluate different methods, where tasks are from various domains instead of class-separated in a single dataset. Our method outperforms other methods in the traditional class-incremental learning setting and the MTIL by 9.7% average score. Our code locates at https://github.com/Thunderbeee/ZSCL.

updated: Fri Aug 11 2023 15:56:32 GMT+0000 (UTC)

published: Sun Mar 12 2023 10:28:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト