BERT Learns to Teach: Knowledge Distillation with Meta Learning

Wangchunshu Zhou; Canwen Xu; Julian McAuley

BERTは教えることを学ぶ：メタ学習による知識の抽出

メタ学習による知識蒸留（MetaDistil）を紹介します。これは、トレーニング中に教師モデルが固定される従来の知識蒸留（KD）メソッドのシンプルで効果的な代替手段です。メタ学習フレームワークでの蒸留された学生ネットワークのパフォーマンスからのフィードバックを使用して、教師ネットワークが学生ネットワークへの知識のより良い伝達（つまり、教えることを学ぶ）を学ぶことができることを示します。さらに、改善された内部学習者に焦点を当てたメタ学習アルゴリズムにおいて、内部学習者とメタ学習者の間の整合性を改善するためのパイロット更新メカニズムを導入します。さまざまなベンチマークでの実験は、MetaDistilが従来のKDアルゴリズムと比較して大幅な改善をもたらし、さまざまな学生の能力とハイパーパラメータの選択に対する感度が低く、さまざまなタスクとモデルでのKDの使用を容易にすることを示しています。

We present Knowledge Distillation with Meta Learning (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn to better transfer knowledge to the student network (i.e., learning to teach) with the feedback from the performance of the distilled student network in a meta learning framework. Moreover, we introduce a pilot update mechanism to improve the alignment between the inner-learner and meta-learner in meta learning algorithms that focus on an improved inner-learner. Experiments on various benchmarks show that MetaDistil can yield significant improvements compared with traditional KD algorithms and is less sensitive to the choice of different student capacity and hyperparameters, facilitating the use of KD on different tasks and models.

updated: Thu Mar 03 2022 00:02:54 GMT+0000 (UTC)

published: Tue Jun 08 2021 17:59:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト