Distillation $\approx$ Early Stopping? Harvesting Dark Knowledge   Utilizing Anisotropic Information Retrieval For Overparameterized Neural   Network

Bin Dong; Jikai Hou; Yiping Lu; Zhihua Zhang

蒸留$ \ approx $早期停止？オーバーパラメータ化されたニューラルネットワークのための異方性情報検索を利用した暗い知識の収集

Distillation $\approx$ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network

蒸留は、あるモデルから別のモデルに知識を転送する方法であり、多くの場合、同じ能力でより高い精度を実現します。この論文では、主に蒸留に役立つものについて理論的な理解を提供することを目指しています。私たちの答えは「早期停止」です。教師ネットワークが過剰にパラメーター化されていると仮定すると、教師ネットワークは本質的に早期停止を介してデータから暗い知識を収集していると主張します。これは、新しい概念{Anisotropic Information Retrieval（AIR）によって正当化できます。これは、ニューラルネットワークが最初に有益な情報に適合し、後で有益な情報（ノイズを含む）に適合する傾向があることを意味します。オーバーパラメーター化されたニューラルネットワークの理論的分析に関する最近の開発に動機付けられて、Neural Tangent Kernel（NTK）の固有空間によってAIRを特徴付けることができます。 AIRは蒸留の新しい理解を提供します。さらに、蒸留を利用してノイズの多いラベルを精製します。間違ったラベルを記憶しないように、以前のトレーニングエポックでネットワークから知識を順次抽出する自己蒸留アルゴリズムを提案します。また、理論的にも経験的にも、自己蒸留は早期停止以上のものから利益を得ることができることを示しています。理論的には、$ \ ell_2 $距離に関してランダムに初期化されたオーバーパラメーター化ニューラルネットワークのグラウンドトゥルースラベルへの提案アルゴリズムの収束を証明しますが、前の結果は$ 0 $-$ 1 $損失の収束でした。理論的な結果により、学習したニューラルネットワークがトレーニングデータのマージンを確保し、一般化が向上します。経験的に、我々はより良いテスト精度を達成し、アルゴリズムをよりユーザーフレンドリーにする早期停止を完全に回避します。

Distillation is a method to transfer knowledge from one model to another and often achieves higher accuracy with the same capacity. In this paper, we aim to provide a theoretical understanding on what mainly helps with the distillation. Our answer is "early stopping". Assuming that the teacher network is overparameterized, we argue that the teacher network is essentially harvesting dark knowledge from the data via early stopping. This can be justified by a new concept, {Anisotropic Information Retrieval (AIR), which means that the neural network tends to fit the informative information first and the non-informative information (including noise) later. Motivated by the recent development on theoretically analyzing overparameterized neural networks, we can characterize AIR by the eigenspace of the Neural Tangent Kernel(NTK). AIR facilities a new understanding of distillation. With that, we further utilize distillation to refine noisy labels. We propose a self-distillation algorithm to sequentially distill knowledge from the network in the previous training epoch to avoid memorizing the wrong labels. We also demonstrate, both theoretically and empirically, that self-distillation can benefit from more than just early stopping. Theoretically, we prove convergence of the proposed algorithm to the ground truth labels for randomly initialized overparameterized neural networks in terms of $\ell_2$ distance, while the previous result was on convergence in $0$-$1$ loss. The theoretical result ensures the learned neural network enjoy a margin on the training data which leads to better generalization. Empirically, we achieve better testing accuracy and entirely avoid early stopping which makes the algorithm more user-friendly.

updated: Wed Oct 02 2019 23:53:39 GMT+0000 (UTC)

published: Wed Oct 02 2019 23:53:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト