Rethinking Knowledge Distillation via Cross-Entropy

Zhendong Yang; Zhe Li; Yuan Gong; Tianke Zhang; Shanshan Lao; Chun Yuan; Yu Li

クロスエントロピーによる知識蒸留の再考

Knowledge Distillation (KD) は広範囲に開発され、さまざまなタスクを強化しました。従来の KD 法では、元のクロスエントロピー (CE) 損失に KD 損失が追加されます。 KD 損失を分解して、CE 損失との関係を調べます。驚くべきことに、これは CE 損失と、CE 損失と同じ形の余分な損失の組み合わせと見なすことができます。ただし、追加の損失により、生徒の相対確率が教師の絶対確率を学習することになります。さらに、2 つの確率の合計が異なるため、最適化が難しくなります。この問題に対処するために、定式化を修正し、分散損失を提案します。また、教師の目標アウトプットをソフトターゲットとして活用し、ソフトロスを提案します。ソフト損失と分散損失を組み合わせて、新しい KD 損失 (NKD) を提案します。さらに、生徒の目標出力を平滑化して、教師なしのトレーニングのソフトターゲットとして扱い、教師なしの新しい KD 損失 (tf-NKD) を提案します。私たちの方法は、CIFAR-100 と ImageNet で最先端のパフォーマンスを実現します。たとえば、ResNet-34 を教師として使用すると、ResNet18 の ImageNet トップ 1 精度が 69.90% から 71.96% に向上します。教師なしのトレーニングでは、MobileNet、ResNet-18、および SwinTransformer-Tiny は 70.04%、70.76%、および 81.48% を達成し、ベースラインよりもそれぞれ 0.83%、0.86%、および 0.30% 高くなります。コードは https://github.com/yzd-v/cls_KD で入手できます。

Knowledge Distillation (KD) has developed extensively and boosted various tasks. The classical KD method adds the KD loss to the original cross-entropy (CE) loss. We try to decompose the KD loss to explore its relation with the CE loss. Surprisingly, we find it can be regarded as a combination of the CE loss and an extra loss which has the identical form as the CE loss. However, we notice the extra loss forces the student's relative probability to learn the teacher's absolute probability. Moreover, the sum of the two probabilities is different, making it hard to optimize. To address this issue, we revise the formulation and propose a distributed loss. In addition, we utilize teachers' target output as the soft target, proposing the soft loss. Combining the soft loss and the distributed loss, we propose a new KD loss (NKD). Furthermore, we smooth students' target output to treat it as the soft target for training without teachers and propose a teacher-free new KD loss (tf-NKD). Our method achieves state-of-the-art performance on CIFAR-100 and ImageNet. For example, with ResNet-34 as the teacher, we boost the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96%. In training without teachers, MobileNet, ResNet-18 and SwinTransformer-Tiny achieve 70.04%, 70.76%, and 81.48%, which are 0.83%, 0.86%, and 0.30% higher than the baseline, respectively. The code is available at https://github.com/yzd-v/cls_KD.

updated: Mon Aug 22 2022 08:32:08 GMT+0000 (UTC)

published: Mon Aug 22 2022 08:32:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト