Generalization Matters: Loss Minima Flattening via Parameter Hybridization for Efficient Online Knowledge Distillation

Tianli Zhang; Mengqi Xue; Jiangtao Zhang; Haofei Zhang; Yu Wang; Lechao Cheng; Jie Song; Mingli Song

一般化の重要性: 効率的なオンライン知識抽出のためのパラメータハイブリダイゼーションによる損失最小平坦化

ほとんどの既存のオンライン知識蒸留 (OKD) 手法では、通常、学生の一般化能力を向上させるための多様な知識を生成するために、洗練されたモジュールが必要です。このホワイトペーパーでは、適切に設計されたモジュールの代わりにマルチモデル設定を十分に活用して、優れた一般化パフォーマンスで蒸留効果を達成するよう努めています。一般に、モデルの一般化は、損失状況の平坦性に反映されます。複数のモデルのパラメーターを平均化すると、より平坦な最小値を見つけることができるため、OKD の複数の学生モデルのサンプリングされた凸結合にプロセスを拡張することに着想を得ています。具体的には、各トレーニングバッチで生徒のパラメーターを線形に重み付けすることにより、関係する生徒を取り巻くパラメーターを表すハイブリッドウェイトモデル (HWM) を構築します。 HWM の監督損失は、一般化を明示的に測定するために、学生の周りの領域全体の景観の曲率を推定できます。したがって、HWMの損失を学生のトレーニングに統合し、パラメーターハイブリッド化（OKDPH）を介して新しいOKDフレームワークを提案して、より平坦な最小値を促進し、堅牢なソリューションを取得します。パラメータの冗長性がHWMの崩壊につながる可能性があることを考慮して、学生の高い類似性を維持するために融合操作をさらに導入します。最先端の (SOTA) OKD 法や平坦な最小値を求める SOTA 法と比較して、当社の OKDPH はより少ないパラメーターでより高いパフォーマンスを実現し、軽量で堅牢な特性を持つ OKD に恩恵をもたらします。私たちのコードは、https://github.com/tianlizhang/OKDPH で公開されています。

Most existing online knowledge distillation(OKD) techniques typically require sophisticated modules to produce diverse knowledge for improving students' generalization ability. In this paper, we strive to fully utilize multi-model settings instead of well-designed modules to achieve a distillation effect with excellent generalization performance. Generally, model generalization can be reflected in the flatness of the loss landscape. Since averaging parameters of multiple models can find flatter minima, we are inspired to extend the process to the sampled convex combinations of multi-student models in OKD. Specifically, by linearly weighting students' parameters in each training batch, we construct a Hybrid-Weight Model(HWM) to represent the parameters surrounding involved students. The supervision loss of HWM can estimate the landscape's curvature of the whole region around students to measure the generalization explicitly. Hence we integrate HWM's loss into students' training and propose a novel OKD framework via parameter hybridization(OKDPH) to promote flatter minima and obtain robust solutions. Considering the redundancy of parameters could lead to the collapse of HWM, we further introduce a fusion operation to keep the high similarity of students. Compared to the state-of-the-art(SOTA) OKD methods and SOTA methods of seeking flat minima, our OKDPH achieves higher performance with fewer parameters, benefiting OKD with lightweight and robust characteristics. Our code is publicly available at https://github.com/tianlizhang/OKDPH.

updated: Sun Mar 26 2023 09:40:55 GMT+0000 (UTC)

published: Sun Mar 26 2023 09:40:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト