Knowledge Distillation for Efficient Sequences of Training Runs

Xingyu Liu; Alex Leonardi; Lu Yu; Chris Gilmer-Hill; Matthew Leavitt; Jonathan Frankle

トレーニング実行の効率的なシーケンスのための知識の蒸留

ハイパーパラメータ検索や新しいデータを使用した継続的な再トレーニングなど、多くの実用的なシナリオでは、関連するトレーニングが何度も連続して実行されます。現在の慣行は、これらの各モデルをゼロから個別にトレーニングすることです。知識蒸留 (KD) を使用して将来の実行のコストを削減するために、以前の実行に費やされた計算を活用する問題を研究します。 KD のオーバーヘッドを考慮しても、以前の実行からの KD で将来の実行を増強すると、これらのモデルのトレーニングに必要な時間が劇的に短縮されることがわかりました。 KD のオーバーヘッドを 80 ～ 90% 削減し、精度への影響を最小限に抑え、全体的なコストをパレート的に大幅に改善する 2 つの戦略を使用して、これらの結果を改善します。 KD は、実際に最終モデルをトレーニングする前の高価な準備作業のコストを削減するための有望な手段であると結論付けています。

In many practical scenarios -- like hyperparameter search or continual retraining with new data -- related training runs are performed many times in sequence. Current practice is to train each of these models independently from scratch. We study the problem of exploiting the computation invested in previous runs to reduce the cost of future runs using knowledge distillation (KD). We find that augmenting future runs with KD from previous runs dramatically reduces the time necessary to train these models, even taking into account the overhead of KD. We improve on these results with two strategies that reduce the overhead of KD by 80-90% with minimal effect on accuracy and vast pareto-improvements in overall cost. We conclude that KD is a promising avenue for reducing the cost of the expensive preparatory work that precedes training final models in practice.

updated: Sat Mar 11 2023 19:03:42 GMT+0000 (UTC)

published: Sat Mar 11 2023 19:03:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト