Embedded Knowledge Distillation in Depth-level Dynamic Neural Network

Shuchang Lyu; Ting-Bing Xu; Guangliang Cheng

深さレベルの動的ニューラルネットワークに埋め込まれた知識の蒸留

実際のアプリケーションでは、さまざまな計算リソースデバイスに、高精度のさまざまな深さのネットワーク（ResNet-18 / 34/50など）が必要です。通常、既存の戦略では、複数のネットワーク（ネット）を設計して個別にトレーニングするか、圧縮技術（低ランク分解、剪定、教師から生徒へ）を利用して、トレーニング済みの大きなモデルを小さなネットに進化させます。これらの方法は、小さなネットの精度が低いか、付随する大きなモデルの依存によって引き起こされる複雑なトレーニングプロセスの影響を受けます。この記事では、同様のアーキテクチャの異なる深度のサブネットを統合した、エレガントな深度レベルの動的ニューラルネットワーク（DDNN）を提案します。異なる深さの構成で個々のネットをトレーニングする代わりに、1セットの共有重みパラメーターを使用して実行時に異なる深さのサブネットを動的に切り替えるようにDDNNのみをトレーニングします。サブネットの一般化を改善するために、DDNNのEmbedded-Knowledge-Distillation（EKD）トレーニングメカニズムを設計して、教師（フル）ネットから複数のサブネットへのセマンティック知識の転送を実装します。具体的には、フルネットとサブネット間の事後クラス確率の一貫性を制約するためにカルバック・ライブラー発散が導入され、異なる深さの同じ解像度の特徴に対する自己注意が、サブネットのより豊富な特徴表現を駆動するために対処されます。したがって、追加の計算コストなしで、各トレーニング反復でのオンライン知識蒸留を介して、DDNNで複数の高精度サブネットを同時に取得できます。 CIFAR-10、CIFAR-100、およびImageNetデータセットでの広範な実験は、EKDトレーニングを使用したDDNNのサブネットが、フルネットの元のパフォーマンスを維持しながら、深度レベルのプルーニングまたは個別トレーニングよりも優れたパフォーマンスを達成することを示しています。

In real applications, different computation-resource devices need different-depth networks (e.g., ResNet-18/34/50) with high-accuracy. Usually, existing strategies either design multiple networks (nets) and train them independently, or utilize compression techniques (e.g., low-rank decomposition, pruning, and teacher-to-student) to evolve a trained large model into a small net. These methods are subject to the low-accuracy of small nets, or complicated training processes induced by the dependence of accompanying assistive large models. In this article, we propose an elegant Depth-level Dynamic Neural Network (DDNN) integrated different-depth sub-nets of similar architectures. Instead of training individual nets with different-depth configurations, we only train a DDNN to dynamically switch different-depth sub-nets at runtime using one set of shared weight parameters. To improve the generalization of sub-nets, we design the Embedded-Knowledge-Distillation (EKD) training mechanism for the DDNN to implement semantic knowledge transfer from the teacher (full) net to multiple sub-nets. Specifically, the Kullback-Leibler divergence is introduced to constrain the posterior class probability consistency between full-net and sub-net, and self-attention on the same resolution feature of different depth is addressed to drive more abundant feature representations of sub-nets. Thus, we can obtain multiple high accuracy sub-nets simultaneously in a DDNN via the online knowledge distillation in each training iteration without extra computation cost. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet datasets demonstrate that sub-nets in DDNN with EKD training achieves better performance than the depth-level pruning or individually training while preserving the original performance of full-net.

updated: Mon Mar 01 2021 06:35:31 GMT+0000 (UTC)

published: Mon Mar 01 2021 06:35:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト