Multimodal Parameter-Efficient Few-Shot Class Incremental Learning

Marco D'Alessandro; Alberto Alonso; Enrique Calabrés; Mikel Galar

マルチモーダルパラメータ効率のよい少数ショットクラスの増分学習

Few-Shot Class Incremental Learning (FSCIL) は、挑戦的な継続学習タスクであり、いくつかの学習セッションで利用できるトレーニング例は限られています。このタスクを成功させるには、少数ショットトレーニングセットの偏った分布によって引き起こされる新しいクラスのオーバーフィッティングを回避する必要があります。この問題に対処するための一般的なアプローチには、古いクラスとの後方互換性のために特別なモジュールを追加することによって、事前定義されたバックボーンアーキテクチャの表現能力を強化することが含まれます。ただし、このアプローチでは、大規模なトレーニングセットと小規模なトレーニングセットで得られるパフォーマンスのギャップを縮小しながら、高い分類精度を長期的に保証するというジレンマはまだ解決されていません。この作業では、異なる学習セッション間の情報の損失を減らすために、Continual Parameter-Efficient CLIP (CPE-CLIP) と呼ばれる代替アプローチを提案します。情報の損失に対処するために追加のモジュールを適応させる代わりに、大規模な事前トレーニングでCLIPによって取得された膨大な知識と、新しい概念への一般化におけるその有効性を活用します。私たちのアプローチはマルチモーダルでパラメーター効率が高く、言語エンコーダーとビジョンエンコーダーの両方の学習可能なプロンプトに依存して、セッション間の転移学習を可能にします。また、パフォーマンスを向上させ、忘れを防ぐために、迅速な正則化も導入します。私たちの実験結果は、CPE-CLIPが最先端の提案と比較してFSCILのパフォーマンスを大幅に改善すると同時に、学習可能なパラメーターの数とトレーニングコストを大幅に削減することを示しています。

Few-Shot Class Incremental Learning (FSCIL) is a challenging continual learning task, where limited training examples are available during several learning sessions. To succeed in this task, it is necessary to avoid over-fitting new classes caused by biased distributions in the few-shot training sets. The general approach to address this issue involves enhancing the representational capability of a pre-defined backbone architecture by adding special modules for backward compatibility with older classes. However, this approach has not yet solved the dilemma of ensuring high classification accuracy over time while reducing the gap between the performance obtained on larger training sets and the smaller ones. In this work, we propose an alternative approach called Continual Parameter-Efficient CLIP (CPE-CLIP) to reduce the loss of information between different learning sessions. Instead of adapting additional modules to address information loss, we leverage the vast knowledge acquired by CLIP in large-scale pre-training and its effectiveness in generalizing to new concepts. Our approach is multimodal and parameter-efficient, relying on learnable prompts for both the language and vision encoders to enable transfer learning across sessions. We also introduce prompt regularization to improve performance and prevent forgetting. Our experimental results demonstrate that CPE-CLIP significantly improves FSCIL performance compared to state-of-the-art proposals while also drastically reducing the number of learnable parameters and training costs.

updated: Wed Mar 08 2023 17:34:15 GMT+0000 (UTC)

published: Wed Mar 08 2023 17:34:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト