Avatar Knowledge Distillation: Self-ensemble Teacher Paradigm with Uncertainty

Yuan Zhang; Weihua Chen; Yichen Lu; Tao Huang; Xiuyu Sun; Jian Cao

アバターの知識の蒸留：不確実性を伴うセルフアンサンブル教師のパラダイム

知識の蒸留は、ポケットサイズモデルのパフォーマンスを向上させるための効果的なパラダイムです。特に複数の教師モデルが利用可能な場合、学生は再び上限を破るでしょう。ただし、使い捨て蒸留のために多様な教師モデルをトレーニングすることは経済的ではありません。この論文では、教師から派生した推論アンサンブルモデルである、蒸留用のアバターと呼ばれる新しい概念を紹介します。具体的には、 (1) 蒸留訓練の反復ごとに、摂動変換によってさまざまなアバターが生成されます。アバターは作業能力と教育能力の上限が高いことを検証し、学生モデルが教師モデルから多様で受容的な知識の視点を学ぶのを支援します。 (2) 蒸留中に、知識伝達に対するアバターの貢献を適応的に調整するために、バニラ教師とアバターの間の統計的差異の分散から不確実性を認識する要因を提案します。 Avatar Knowledge Distillation AKD は、既存の方法とは根本的に異なり、不平等なトレーニングの革新的な見方で洗練されています。包括的な実験により、追加の計算コストなしで高密度予測のための最先端の蒸留方法を洗練する当社の Avatars メカニズムの有効性が実証されています。 AKD は、オブジェクト検出の COCO 2017 で最大 0.7 AP の増加をもたらし、セマンティックセグメンテーションの都市景観で 1.83 mIoU の増加をそれぞれもたらします。

Knowledge distillation is an effective paradigm for boosting the performance of pocket-size model, especially when multiple teacher models are available, the student would break the upper limit again. However, it is not economical to train diverse teacher models for the disposable distillation. In this paper, we introduce a new concept dubbed Avatars for distillation, which are the inference ensemble models derived from the teacher. Concretely, (1) For each iteration of distillation training, various Avatars are generated by a perturbation transformation. We validate that Avatars own higher upper limit of working capacity and teaching ability, aiding the student model in learning diverse and receptive knowledge perspectives from the teacher model. (2) During the distillation, we propose an uncertainty-aware factor from the variance of statistical differences between the vanilla teacher and Avatars, to adjust Avatars' contribution on knowledge transfer adaptively. Avatar Knowledge Distillation AKD is fundamentally different from existing methods and refines with the innovative view of unequal training. Comprehensive experiments demonstrate the effectiveness of our Avatars mechanism, which polishes up the state-of-the-art distillation methods for dense prediction without more extra computational cost. The AKD brings at most 0.7 AP gains on COCO 2017 for Object Detection and 1.83 mIoU gains on Cityscapes for Semantic Segmentation, respectively.

updated: Mon Jul 31 2023 14:43:33 GMT+0000 (UTC)

published: Thu May 04 2023 10:43:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト