Densely Guided Knowledge Distillation using Multiple Teacher Assistants

Wonchul Son; Jaemin Na; Junyong Choi; Wonjun Hwang

複数のティーチャーアシスタントを使用した密にガイドされた知識の蒸留

ディープニューラルネットワークの成功に伴い、大規模な教師ネットワークから小規模な学生ネットワークの学習を導く知識の蒸留が、モデルの圧縮と転移学習のために活発に研究されています。ただし、学生と教師のモデルサイズが大幅に異なる場合、学生ネットワークの不十分な学習問題を解決するために実行された研究はほとんどありません。本論文では、モデルサイズを徐々に縮小し、教師と生徒のネットワーク間の大きなギャップを効率的に埋める、複数の教師アシスタントを使用した高密度ガイド付き知識蒸留を提案します。学生ネットワークのより効率的な学習を刺激するために、私たちは各ティーチャーアシスタントを他のすべての小さなティーチャーアシスタントに繰り返し案内します。具体的には、次のステップで小さなティーチャーアシスタントを教えるときは、前のステップの既存の大きなティーチャーアシスタントと教師ネットワークが使用されます。さらに、ミニバッチごとに、教師または教師アシスタントがランダムにドロップされる確率的ティーチングを設計します。これは、学生ネットワークの教育の効率を向上させるための正則化として機能します。したがって、学生は常に複数のソースから顕著な蒸留知識を学ぶことができます。 CIFAR-10、CIFAR-100、およびImageNetを使用して、分類タスクに対する提案された方法の有効性を検証しました。また、ResNet、WideResNet、VGGなどのさまざまなバックボーンアーキテクチャで大幅なパフォーマンスの向上を達成しました。

With the success of deep neural networks, knowledge distillation which guides the learning of a small student network from a large teacher network is being actively studied for model compression and transfer learning. However, few studies have been performed to resolve the poor learning issue of the student network when the student and teacher model sizes significantly differ. In this paper, we propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size to efficiently bridge the large gap between the teacher and student networks. To stimulate more efficient learning of the student network, we guide each teacher assistant to every other smaller teacher assistants iteratively. Specifically, when teaching a smaller teacher assistant at the next step, the existing larger teacher assistants from the previous step are used as well as the teacher network. Moreover, we design stochastic teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped. This acts as a regularizer to improve the efficiency of teaching of the student network. Thus, the student can always learn salient distilled knowledge from the multiple sources. We verified the effectiveness of the proposed method for a classification task using CIFAR-10, CIFAR-100, and ImageNet. We also achieved significant performance improvements with various backbone architectures such as ResNet, WideResNet, and VGG.

updated: Mon Aug 09 2021 05:48:48 GMT+0000 (UTC)

published: Fri Sep 18 2020 13:12:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト