Improving Vision Transformers for Incremental Learning

Pei Yu; Yinpeng Chen; Ying Jin; Zicheng Liu

インクリメンタル学習のためのビジョントランスフォーマーの改善

このホワイトペーパーでは、クラスの増分学習でVision Transformer（ViT）を使用するための実用的なレシピを提案します。このレシピは既存の技術を組み合わせただけですが、組み合わせを開発することは簡単ではありません。まず、インクリメンタル学習で畳み込みニューラルネットワーク（CNN）を置き換えるためのViTのナイーブなアプリケーションは、深刻なパフォーマンスの低下をもたらします。次に、ViTを素朴に使用する3つの問題を突き止めます。（a）クラスの数が少ない場合、ViTの収束は非常に遅い、（b）CNNベースのアーキテクチャよりもViTで新しいクラスへのバイアスが観察される、（c） ViTの従来の学習率は低すぎて、適切な分類器レイヤーを学習できません。最後に、ViTIL（ViT for Incremental Learning）という名前のソリューションは、3つのクラスのインクリメンタル学習セットアップすべてについて、CIFARデータセットとImageNetデータセットの両方で新しい最先端技術を明確なマージンで実現します。これにより、インクリメンタルラーニングコミュニティにおけるトランスフォーマーの知識が向上すると考えています。コードは公開されます。

This paper proposes a working recipe of using Vision Transformer (ViT) in class incremental learning. Although this recipe only combines existing techniques, developing the combination is not trivial. Firstly, naive application of ViT to replace convolutional neural networks (CNNs) in incremental learning results in serious performance degradation. Secondly, we nail down three issues of naively using ViT: (a) ViT has very slow convergence when the number of classes is small, (b) more bias towards new classes is observed in ViT than CNN-based architectures, and (c) the conventional learning rate of ViT is too low to learn a good classifier layer. Finally, our solution, named ViTIL (ViT for Incremental Learning) achieves new state-of-the-art on both CIFAR and ImageNet datasets for all three class incremental learning setups by a clear margin. We believe this advances the knowledge of transformer in the incremental learning community. Code will be publicly released.

updated: Fri Apr 15 2022 18:25:31 GMT+0000 (UTC)

published: Sun Dec 12 2021 00:12:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト