From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Zhendong Yang; Ailing Zeng; Zhe Li; Tianke Zhang; Chun Yuan; Yu Li

知識の蒸留から自己知識の蒸留へ: 正規化された損失とカスタマイズされたソフトラベルによる統一されたアプローチ

知識蒸留 (KD) は、教師の予測ロジットをソフトラベルとして使用して学生を導きますが、自己 KD はソフトラベルを要求するために実際の教師を必要としません。この作業では、一般的な KD 損失を正規化 KD (NKD) 損失に分解および再編成し、ターゲットクラス (画像のカテゴリ) とユニバーサルセルフナレッジ蒸留 (USKD) という名前の非ターゲットクラスの両方のカスタマイズされたソフトラベルを作成することにより、2 つのタスクの定式化を統一します。）。 KD 損失を分解し、非目標損失が生徒の非目標ロジットを教師の非目標ロジットに強制的に一致させることを発見しますが、2 つの非目標ロジットの合計は異なり、それらが同一になることを防ぎます。 NKD は、非ターゲットロジットを正規化して合計を均等化します。通常、KD およびセルフ KD に使用して、ソフトラベルを蒸留損失に対してより有効に使用できます。 USKD は、教師なしで、ターゲットクラスと非ターゲットクラスの両方に対してカスタマイズされたソフトラベルを生成します。生徒のターゲットロジットをソフトターゲットラベルとして平滑化し、中間機能のランクを使用して、Zipf の法則でソフト非ターゲットラベルを生成します。教師による KD の場合、当社の NKD は CIFAR-100 および ImageNet データセットで最先端のパフォーマンスを達成し、ResNet-34 教師による ResNet18 の ImageNet トップ 1 精度を 69.90% から 71.96% に高めます。教師なしのセルフ KD の場合、USKD は CNN モデルと ViT モデルの両方に効果的に適用でき、追加の時間とメモリコストを無視できる最初のセルフ KD メソッドであり、1.17% などの新しい最先端の結果が得られます。 MobileNet と DeiT-Tiny の ImageNet でそれぞれ 0.55% の精度向上。コードは https://github.com/yzd-v/cls_KD で入手できます。

Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image's category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student's non-target logits to match the teacher's, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our codes are available at https://github.com/yzd-v/cls_KD.

updated: Mon Jul 17 2023 12:22:21 GMT+0000 (UTC)

published: Thu Mar 23 2023 02:59:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト