Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones

Cheng Cui; Ruoyu Guo; Yuning Du; Dongliang He; Fu Li; Zewu Wu; Qiwen Liu; Shilei Wen; Jizhou Huang; Xiaoguang Hu; Dianhai Yu; Errui Ding; Yanjun Ma

自己監視を超えて：バックボーンを改善するためのシンプルで効果的なネットワーク蒸留の代替手段

最近、研究努力は、事前に訓練されたモデルがニューラルネットワークのパフォーマンスにどのように違いをもたらすかを明らかにすることに集中しています。自己監視および半教師あり学習テクノロジーは、コミュニティによって広範に調査されており、強力な事前トレーニング済みモデルを取得する上で大きな可能性があることが証明されています。ただし、これらのモデルには莫大なトレーニングコストが必要です（つまり、何億もの画像またはトレーニングの反復）。この論文では、既成の事前訓練された大きな強力なモデルからの知識蒸留を介して、既存のベースラインネットワークを改善することを提案します。学生モデルが教師モデルによって生成されたソフトラベルと人間によって注釈が付けられたハードラベルの両方と一致する必要がある既存の知識蒸留フレームワークとは異なり、私たちのソリューションは、教師モデルと一致する学生モデルの予測を駆動するだけで蒸留を実行します。したがって、私たちの蒸留設定は、手動でラベル付けされたデータを取り除くことができ、より良い学習のために教師モデルの機能を完全に活用するために、追加のラベル付けされていないデータでトレーニングすることができます。このような単純な蒸留設定は非常に効果的であることが経験的にわかっています。たとえば、MobileNetV3-largeおよびResNet50-DのImageNet-1k検証セットのトップ1の精度は、75.2％から79％および79.1％から83％に大幅に向上する可能性があります。、それぞれ。また、蒸留性能に影響を与える主な要因と、それらがどのように違いを生むかについても徹底的に分析しました。転送学習、オブジェクト検出、セマンティックセグメンテーションなど、広範囲にわたるダウンストリームのコンピュータービジョンタスクは、事前にトレーニングされたモデルから大きなメリットを得ることができます。すべての実験はPaddlePaddleに基づいて実装されており、コードとssldサフィックスが付いた一連の改良された事前トレーニング済みモデルがPaddleClasで利用できます。

Recently, research efforts have been concentrated on revealing how pre-trained model makes a difference in neural network performance. Self-supervision and semi-supervised learning technologies have been extensively explored by the community and are proven to be of great potential in obtaining a powerful pre-trained model. However, these models require huge training costs (i.e., hundreds of millions of images or training iterations). In this paper, we propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models. Different from existing knowledge distillation frameworks which require student model to be consistent with both soft-label generated by teacher model and hard-label annotated by humans, our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model. Therefore, our distillation setting can get rid of manually labeled data and can be trained with extra unlabeled data to fully exploit capability of teacher model for better learning. We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved from 75.2% to 79% and 79.1% to 83%, respectively. We have also thoroughly analyzed what are dominant factors that affect the distillation performance and how they make a difference. Extensive downstream computer vision tasks, including transfer learning, object detection and semantic segmentation, can significantly benefit from the distilled pretrained models. All our experiments are implemented based on PaddlePaddle, codes and a series of improved pretrained models with ssld suffix are available in PaddleClas.

updated: Wed Mar 10 2021 09:32:44 GMT+0000 (UTC)

published: Wed Mar 10 2021 09:32:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト