Mesa: A Memory-saving Training Framework for Transformers

Zizheng Pan; Peng Chen; Haoyu He; Jing Liu; Jianfei Cai; Bohan Zhuang

Mesa：トランスフォーマーのためのメモリ節約トレーニングフレームワーク

高性能トランスフォーマーの設計への関心が爆発的に高まっています。 Transformersは大幅なパフォーマンスの向上をもたらしましたが、特に長いシーケンスの場合、バックプロパゲーション中の勾配計算に必要なすべての中間アクティベーションを保存するため、このようなネットワークのトレーニングは非常にメモリを消費します。この目的のために、Transformersのメモリを節約するリソース効率の高いトレーニングフレームワークであるMesaを紹介します。具体的には、Mesaはフォワードパス中に正確なアクティベーションを使用し、トレーニング中のメモリ消費を削減するために低精度バージョンのアクティベーションを保存します。次に、低精度のアクティベーションは、勾配を計算するためにバックプロパゲーション中に非量子化されます。さらに、マルチヘッド自己注意層の不均一な活性化分布に対処するために、近似誤差を最小化するために各ヘッドの統計に基づいて活性化を量子化するヘッドワイズ活性化量子化戦略を提案します。トレーニング効率をさらに高めるために、推定を実行して量子化パラメーターを学習します。さらに重要なことに、節約されたメモリをより大きなバッチサイズの採用やモデルサイズのスケールアップに再投資することで、制約のある計算リソースの下でパフォーマンスをさらに向上させることができます。 ImageNet、CIFAR-100、およびADE20Kでの広範な実験により、Mesaは、同等またはそれ以上のパフォーマンスを達成しながら、トレーニング中にメモリフットプリントの半分を削減できることが実証されています。コードはhttps://github.com/zhuang-group/Mesaで入手できます。

There has been an explosion of interest in designing high-performance Transformers. While Transformers have delivered significant performance improvements, training such networks is extremely memory intensive owing to storing all intermediate activations that are needed for gradient computation during backpropagation, especially for long sequences. To this end, we present Mesa, a memory-saving resource-efficient training framework for Transformers. Specifically, Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. The low-precision activations are then dequantized during back-propagation to compute gradients. Besides, to address the heterogeneous activation distributions in the multi-head self-attention layers, we propose a head-wise activation quantization strategy, which quantizes activations based on the statistics of each head to minimize the approximation error. To further boost training efficiency, we learn quantization parameters by running estimates. More importantly, by re-investing the saved memory in employing a larger batch size or scaling up model size, we may further improve the performance under constrained computational resources. Extensive experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training while achieving comparable or even better performance. Code is available at https://github.com/zhuang-group/Mesa

updated: Mon Nov 22 2021 11:23:01 GMT+0000 (UTC)

published: Mon Nov 22 2021 11:23:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト