Distilling Knowledge via Intermediate Classifiers

Aryan Asadian; Amirali Salehi-Abari

中間分類器による知識の抽出

知識抽出の核心は、事前に訓練された大規模な教師モデルのガイドを使用して、リソースが限られている生徒モデルを効果的に訓練することです。ただし、教師と生徒のモデルの複雑性に大きな違いがある場合 (つまり、能力ギャップ)、知識の蒸留は、教師から生徒に知識を伝達する力を失い、弱い生徒を訓練します。容量ギャップの影響を軽減するために、中間ヘッドによる知識の蒸留を導入します。教師の中間層を (さまざまな深さで) 分類器ヘッドで拡張することにより、異種の事前訓練された教師のコホートを安価に取得します。中級の分類器長は、事前訓練された教師のバックボーンを凍結しながら、まとめて効率的に学習できます。教師のコホート (元の教師を含む) は、生徒を同時に教えます。さまざまな教師と生徒のペアとデータセットに関する私たちの実験は、提案されたアプローチが正規の知識抽出アプローチとその拡張よりも優れていることを示しています。

The crux of knowledge distillation is to effectively train a resource-limited student model with the guide of a pre-trained larger teacher model. However, when there is a large difference between the model complexities of teacher and student (i.e., capacity gap), knowledge distillation loses its strength in transferring knowledge from the teacher to the student, thus training a weaker student. To mitigate the impact of the capacity gap, we introduce knowledge distillation via intermediate heads. By extending the intermediate layers of the teacher (at various depths) with classifier heads, we cheaply acquire a cohort of heterogeneous pre-trained teachers. The intermediate classifier heads can all together be efficiently learned while freezing the backbone of the pre-trained teacher. The cohort of teachers (including the original teacher) co-teach the student simultaneously. Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach and its extensions.

updated: Mon May 31 2021 13:20:57 GMT+0000 (UTC)

published: Sun Feb 28 2021 12:52:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト