Layerwise Optimization by Gradient Decomposition for Continual Learning

Shixiang Tang; Dapeng Chen; Jinguo Zhu; Shijie Yu; Wanli Ouyang

継続学習のための勾配分解による層ごとの最適化

ディープニューラルネットワークは、さまざまなドメインにわたって最先端の、時には超人的なパフォーマンスを実現します。ただし、タスクを順番に学習すると、ネットワークは「壊滅的な忘却」と呼ばれる以前のタスクの知識を簡単に忘れてしまいます。古いタスクと新しいタスクの間の一貫性を実現するための1つの効果的な解決策は、更新のために勾配を変更することです。以前の方法では、さまざまなタスクに独立した勾配制約を適用しますが、これらの勾配には複雑な情報が含まれていると考え、勾配分解によってタスク間情報を活用することを提案します。特に、古いタスクの勾配は、すべての古いタスクによって共有される部分と、そのタスクに固有の部分に分解されます。更新の勾配は、新しいタスクの勾配に近く、すべての古いタスクで共有されている勾配と一致し、古いタスクに固有の勾配がまたがるスペースに直交している必要があります。このように、私たちのアプローチは、タスク固有の知識を損なうことなく、一般的な知識の統合を促進します。さらに、最適化は、以前の作業のようにすべての勾配を連結するのではなく、各レイヤーの勾配に対して個別に実行されます。これにより、異なるレイヤーの勾配の大きさの変化の影響を効果的に回避できます。広範な実験により、勾配分解最適化とレイヤーごとの更新の両方の有効性が検証されます。私たちの提案する方法は、継続的な学習のさまざまなベンチマークで最先端の結果を達成します。

Deep neural networks achieve state-of-the-art and sometimes super-human performance across various domains. However, when learning tasks sequentially, the networks easily forget the knowledge of previous tasks, known as "catastrophic forgetting". To achieve the consistencies between the old tasks and the new task, one effective solution is to modify the gradient for update. Previous methods enforce independent gradient constraints for different tasks, while we consider these gradients contain complex information, and propose to leverage inter-task information by gradient decomposition. In particular, the gradient of an old task is decomposed into a part shared by all old tasks and a part specific to that task. The gradient for update should be close to the gradient of the new task, consistent with the gradients shared by all old tasks, and orthogonal to the space spanned by the gradients specific to the old tasks. In this way, our approach encourages common knowledge consolidation without impairing the task-specific knowledge. Furthermore, the optimization is performed for the gradients of each layer separately rather than the concatenation of all gradients as in previous works. This effectively avoids the influence of the magnitude variation of the gradients in different layers. Extensive experiments validate the effectiveness of both gradient-decomposed optimization and layer-wise updates. Our proposed method achieves state-of-the-art results on various benchmarks of continual learning.

updated: Mon May 17 2021 01:15:57 GMT+0000 (UTC)

published: Mon May 17 2021 01:15:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト