Initialization and Regularization of Factorized Neural Layers

Mikhail Khodak; Neil Tenenholtz; Lester Mackey; Nicolò Fusi

因数分解された神経層の初期化と正則化

因数分解されたレイヤー（2つ以上の行列の積によってパラメーター化された操作）は、圧縮モデルトレーニング、特定のタイプの知識蒸留、マルチヘッド自己注意アーキテクチャなど、さまざまな深層学習コンテキストで発生します。そのような層を含む深いネットを初期化および正規化する方法を研究し、2つの単純で十分に研究されていないスキーム、スペクトル初期化とフロベニウス崩壊を調べて、それらの性能を改善します。指針となる洞察は、これらのネットワークの最適化ルーチンを、適切に調整され、分解されていない対応するものに可能な限り近いものとして設計することです。この直感を裏付けるのは、初期化と正則化のスキームが最急降下法によるトレーニングにどのように影響するかを分析し、重量減衰とバッチ正規化の相互作用を理解するための最新の試みを利用することです。経験的に、さまざまな設定でのスペクトル初期化とフロベニウス減衰の利点を強調します。モデル圧縮では、低ランクの方法が、低メモリの残差ネットワークをトレーニングするタスクで、非構造化スパース法とテンソル法の両方を大幅に上回ることができることを示します。スキームの類似物は、テンソル分解手法のパフォーマンスも向上させます。知識の蒸留の場合、Frobeniusの減衰により、教師ネットワークを使用した再トレーニングや剪定を必要とせずに、パラメーター化されたトレーニングからコンパクトなモデルを生成する、単純で完全なベースラインが可能になります。最後に、マルチヘッドアテンションに適用された両方のスキームが、翻訳と教師なし事前トレーニングの両方でパフォーマンスの向上にどのようにつながるかを示します。

Factorized layers--operations parameterized by products of two or more matrices--occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head self-attention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as possible to that of their well-tuned, non-decomposed counterparts; we back this intuition with an analysis of how the initialization and regularization schemes impact training with gradient descent, drawing on modern attempts to understand the interplay of weight-decay and batch-normalization. Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks; analogs of the schemes also improve the performance of tensor decomposition techniques. For knowledge distillation, Frobenius decay enables a simple, overcomplete baseline that yields a compact model from over-parameterized training without requiring retraining with or pruning a teacher network. Finally, we show how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.

updated: Tue Oct 04 2022 19:00:39 GMT+0000 (UTC)

published: Mon May 03 2021 17:28:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト