Peer Collaborative Learning for Online Knowledge Distillation

Guile Wu; Shaogang Gong

オンライン知識蒸留のためのピア共同学習

従来の知識蒸留では、2段階のトレーニング戦略を使用して、大容量の教師モデルから、事前にトレーニングされた教師に大きく依存するコンパクトな学生モデルに知識を転送します。最近のオンライン知識の蒸留は、1段階のエンドツーエンドのトレーニング方法に従って、共同学習、相互学習、およびオンラインアンサンブルによってこの制限を緩和します。ただし、共同学習と相互学習はオンラインの大容量教師を構築できませんが、オンラインアンサンブルはブランチ間のコラボレーションを無視し、そのロジットの合計はアンサンブル教師のさらなる最適化を妨げます。この作業では、オンラインアンサンブルとネットワークコラボレーションを統合フレームワークに統合する、オンライン知識蒸留のための新しいピアコラボレーティブラーニング手法を提案します。具体的には、ターゲットネットワークを前提として、トレーニング用のマルチブランチネットワークを構築します。このネットワークでは、各ブランチがピアと呼ばれます。ピアへの入力に対してランダムな拡張を複数回実行し、ピアアンサンブルティーチャーとして追加の分類子を使用してピアから出力された特徴表現を組み立てます。これは、大容量の教師から仲間に知識を伝達するのに役立ち、さらにアンサンブル教師をさらに最適化します。一方、各ピアの時間平均モデルをピア平均教師として採用し、ピア間で知識を共同で転送します。これにより、各ピアはより豊富な知識を学習し、より安定したモデルをより一般化して最適化することができます。 CIFAR-10、CIFAR-100、およびImageNetでの広範な実験は、提案された方法がさまざまなバックボーンネットワークの一般化を大幅に改善し、最先端の方法よりも優れていることを示しています。

Traditional knowledge distillation uses a two-stage training strategy to transfer knowledge from a high-capacity teacher model to a compact student model, which relies heavily on the pre-trained teacher. Recent online knowledge distillation alleviates this limitation by collaborative learning, mutual learning and online ensembling, following a one-stage end-to-end training fashion. However, collaborative learning and mutual learning fail to construct an online high-capacity teacher, whilst online ensembling ignores the collaboration among branches and its logit summation impedes the further optimisation of the ensemble teacher. In this work, we propose a novel Peer Collaborative Learning method for online knowledge distillation, which integrates online ensembling and network collaboration into a unified framework. Specifically, given a target network, we construct a multi-branch network for training, in which each branch is called a peer. We perform random augmentation multiple times on the inputs to peers and assemble feature representations outputted from peers with an additional classifier as the peer ensemble teacher. This helps to transfer knowledge from a high-capacity teacher to peers, and in turn further optimises the ensemble teacher. Meanwhile, we employ the temporal mean model of each peer as the peer mean teacher to collaboratively transfer knowledge among peers, which helps each peer to learn richer knowledge and facilitates to optimise a more stable model with better generalisation. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet show that the proposed method significantly improves the generalisation of various backbone networks and outperforms the state-of-the-art methods.

updated: Wed Mar 03 2021 15:00:39 GMT+0000 (UTC)

published: Sun Jun 07 2020 13:21:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト