DistPro: Searching A Fast Knowledge Distillation Process via Meta Optimization

Xueqing Deng; Dawei Sun; Shawn Newsam; Peng Wang

DistPro：メタ最適化による高速な知識蒸留プロセスの検索

最近の知識蒸留（KD）の研究では、手動で設計されたさまざまなスキームが学習結果に大きな影響を与えることが示されています。しかし、KDでは、最適な蒸留スキームを自動的に検索することはまだ十分に検討されていません。この論文では、微分可能なメタ学習を介して最適なKDプロセスを検索する新しいフレームワークであるDistProを提案します。具体的には、生徒と教師のネットワークのペアを前提として、DistProは最初に教師の送信層から生徒の受信層への豊富なKD接続のセットを設定し、その間に、特徴マップを比較するためのさまざまな変換も提案されます。蒸留の経路に沿って。次に、接続と変換の選択（経路）の各組み合わせは、蒸留中のすべてのステップでの重要性を示す確率的重み付けプロセスに関連付けられます。検索段階では、提案された2レベルのメタ最適化戦略を通じてプロセスを効果的に学習できます。蒸留段階では、DistProは知識蒸留のために学習したプロセスを採用します。これにより、特により高速なトレーニングが必要な場合に、学生の精度が大幅に向上します。最後に、学習したプロセスを同様のタスクとネットワーク間で一般化できることがわかりました。私たちの実験では、DistProは、人気のあるデータセット、つまりCIFAR100やImageNetでさまざまな数の学習エポックの下で、最先端の（SoTA）精度を生成します。これは、フレームワークの有効性を示しています。

Recent Knowledge distillation (KD) studies show that different manually designed schemes impact the learned results significantly. Yet, in KD, automatically searching an optimal distillation scheme has not yet been well explored. In this paper, we propose DistPro, a novel framework which searches for an optimal KD process via differentiable meta-learning. Specifically, given a pair of student and teacher networks, DistPro first sets up a rich set of KD connection from the transmitting layers of the teacher to the receiving layers of the student, and in the meanwhile, various transforms are also proposed for comparing feature maps along its pathway for the distillation. Then, each combination of a connection and a transform choice (pathway) is associated with a stochastic weighting process which indicates its importance at every step during the distillation. In the searching stage, the process can be effectively learned through our proposed bi-level meta-optimization strategy. In the distillation stage, DistPro adopts the learned processes for knowledge distillation, which significantly improves the student accuracy especially when faster training is required. Lastly, we find the learned processes can be generalized between similar tasks and networks. In our experiments, DistPro produces state-of-the-art (SoTA) accuracy under varying number of learning epochs on popular datasets, i.e. CIFAR100 and ImageNet, which demonstrate the effectiveness of our framework.

updated: Tue Apr 12 2022 06:22:24 GMT+0000 (UTC)

published: Tue Apr 12 2022 06:22:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト