Representative Teacher Keys for Knowledge Distillation Model Compression Based on Attention Mechanism for Image Classification

Jun-Teng Yang; Sheng-Che Kao; Scott C. -H. Huang

画像分類のための注意機構に基づく知識抽出モデル圧縮のための代表教師鍵

AI チップ (GPU、TPU、NPU など) の改良とモノのインターネット (IoT) の急速な発展により、一部の堅牢なディープニューラルネットワーク (DNN) は通常、数百万または数億のパラメーターで構成されています。このような大規模なモデルは、低計算および低容量のユニット (エッジデバイスなど) に直接展開するのには適していない場合があります。知識蒸留 (KD) は、モデルパラメーターを効果的に削減するための強力なモデル圧縮方法として最近認識されています。 KD の中心的な概念は、モデルサイズが教師よりもはるかに小さい小さなモデル (つまり、生徒モデル) を正常にトレーニングするための参照として、大きなモデル (つまり、教師モデル) の特徴マップから有用な情報を抽出することです。 1。教師モデルの中間層の特徴マップからの情報を利用するために多くの KD 法が提案されていますが、ほとんどは教師モデルと生徒モデルの間の特徴マップの類似性を考慮していませんでした。その結果、生徒モデルに無駄な情報を学習させる可能性があります。注意メカニズムに着想を得て、代表教師鍵 (RTK) と呼ばれる新しい KD メソッドを提案します。これは、特徴マップの類似性を考慮するだけでなく、無用な情報を除外して、対象の生徒モデルのパフォーマンスを向上させます。実験では、いくつかのバックボーンネットワーク (ResNet や WideResNet など) とデータセット (CIFAR10、CIFAR100、SVHN、CINIC10 など) を使用して提案手法を検証します。結果は、提案されたRTKが最先端の注意ベースのKD法の分類精度を効果的に改善できることを示しています。

With the improvement of AI chips (e.g., GPU, TPU, and NPU) and the fast development of the Internet of Things (IoT), some robust deep neural networks (DNNs) are usually composed of millions or even hundreds of millions of parameters. Such a large model may not be suitable for directly deploying on low computation and low capacity units (e.g., edge devices). Knowledge distillation (KD) has recently been recognized as a powerful model compression method to decrease the model parameters effectively. The central concept of KD is to extract useful information from the feature maps of a large model (i.e., teacher model) as a reference to successfully train a small model (i.e., student model) in which the model size is much smaller than the teacher one. Although many KD methods have been proposed to utilize the information from the feature maps of intermediate layers in the teacher model, most did not consider the similarity of feature maps between the teacher model and the student model. As a result, it may make the student model learn useless information. Inspired by the attention mechanism, we propose a novel KD method called representative teacher key (RTK) that not only considers the similarity of feature maps but also filters out the useless information to improve the performance of the target student model. In the experiments, we validate our proposed method with several backbone networks (e.g., ResNet and WideResNet) and datasets (e.g., CIFAR10, CIFAR100, SVHN, and CINIC10). The results show that our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.

updated: Thu Oct 20 2022 05:35:23 GMT+0000 (UTC)

published: Sun Jun 26 2022 05:08:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト