Meta-attention for ViT-backed Continual Learning

Mengqi Xue; Haofei Zhang; Jie Song; Mingli Song

ViTに裏打ちされた継続学習のためのメタアテンション

継続的な学習は、継続的に到着するタスクに取り組む上で重要な役割を果たしているため、長年の研究トピックです。これまで、コンピュータービジョンの継続学習の研究は、主に畳み込みニューラルネットワーク（CNN）に限定されていました。ただし、最近、新たに出現したビジョントランスフォーマー（ViT）がコンピュータービジョンの分野を徐々に支配する傾向があり、ViTに直接適用すると、パフォーマンスが大幅に低下する可能性があるため、CNNベースの継続的な学習は遅れをとっています。この論文では、ViTの最近の進歩に乗って、より高いパフォーマンスを目指して努力するために、ViTに裏打ちされた継続的な学習を研究します。事前にトレーニングされたViTを新しいタスクに適応させるためにタスクごとにマスクが学習される、CNNのマスクベースの継続的学習方法に触発されて、MEta-ATtention（MEAT）、つまり自己注意への注意を提案します。すでに学習したタスクのパフォーマンスを犠牲にすることなく、新しいタスクに対して事前にトレーニングされたViT。すべてのパラメーターが対応するマスクに関連付けられているPiggybackのような以前のマスクベースの方法とは異なり、MEATはViTの特性を活用し、パラメーターの一部のみをマスクします。これにより、MEATがより効率的かつ効果的になり、オーバーヘッドが少なくなり、精度が高くなります。広範な実験により、MEATは、最新のCNNに比べて大幅な優位性を示し、精度が4.0〜6.0％向上します。私たちのコードはhttps://github.com/zju-vipa/MEAT-TILでリリースされています。

Continual learning is a longstanding research topic due to its crucial role in tackling continually arriving tasks. Up to now, the study of continual learning in computer vision is mainly restricted to convolutional neural networks (CNNs). However, recently there is a tendency that the newly emerging vision transformers (ViTs) are gradually dominating the field of computer vision, which leaves CNN-based continual learning lagging behind as they can suffer from severe performance degradation if straightforwardly applied to ViTs. In this paper, we study ViT-backed continual learning to strive for higher performance riding on recent advances of ViTs. Inspired by mask-based continual learning methods in CNNs, where a mask is learned per task to adapt the pre-trained ViT to the new task, we propose MEta-ATtention (MEAT), i.e., attention to self-attention, to adapt a pre-trained ViT to new tasks without sacrificing performance on already learned tasks. Unlike prior mask-based methods like Piggyback, where all parameters are associated with corresponding masks, MEAT leverages the characteristics of ViTs and only masks a portion of its parameters. It renders MEAT more efficient and effective with less overhead and higher accuracy. Extensive experiments demonstrate that MEAT exhibits significant superiority to its state-of-the-art CNN counterparts, with 4.0~6.0% absolute boosts in accuracy. Our code has been released at https://github.com/zju-vipa/MEAT-TIL.

updated: Tue Mar 22 2022 12:58:39 GMT+0000 (UTC)

published: Tue Mar 22 2022 12:58:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト