Efficient Strong Scaling Through Burst Parallel Training

Seo Jin Park; Joshua Fried; Sunghyun Kim; Mohammad Alizadeh; Adam Belay

バースト並列トレーニングによる効率的な強力なスケーリング

新たなディープニューラルネットワーク（DNN）モデルのサイズが拡大し続けるにつれて、大規模なGPUクラスターを使用してDNNをトレーニングすることは、許容可能なトレーニング時間を達成するための必須要件になりつつあります。このホワイトペーパーでは、クラスターサイズが将来増加すると、モデルのトレーニングに使用できるグローバルバッチサイズが基本的な制限に達する場合を検討します。特定のポイントを超えると、グローバルバッチサイズが大きくなるとサンプルの効率が低下し、全体的に増加します。精度までの時間。その結果、トレーニングパフォーマンスをさらに向上させるには、代わりに、グローバルバッチサイズを一定に保ち、各GPUに小さいバッチを割り当てる「強力なスケーリング」戦略を検討する必要があります。残念ながら、これにより、クラスターリソースを効率的に使用することが非常に困難になります。 2つの重要なアイデアを通じてこの効率の課題に対処するシステムであるDeepPoolを紹介します。まず、バースト並列処理は、多数のGPUをバーストでフォアグラウンドジョブに割り当てて、レイヤー間の並列処理の不均一性を利用します。次に、GPU多重化は、フォアグラウンドトレーニングジョブのスループットを優先し、バックグラウンドトレーニングジョブをパックして、十分に活用されていないGPUリソースを再利用することで、クラスター全体の使用率を向上させます。これらの2つのアイデアを組み合わせることで、DeepPoolは、クラスターの規模が大きい場合に、単一のタスクで標準のデータ並列処理よりもクラスターの総スループットを1.2〜2.3倍向上させることができます。

As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future increases in cluster size will cause the global batch size that can be used to train models to reach a fundamental limit: beyond a certain point, larger global batch sizes cause sample efficiency to degrade, increasing overall time to accuracy. As a result, to achieve further improvements in training performance, we must instead consider "strong scaling" strategies that hold the global batch size constant and allocate smaller batches to each GPU. Unfortunately, this makes it significantly more difficult to use cluster resources efficiently. We present DeepPool, a system that addresses this efficiency challenge through two key ideas. First, burst parallelism allocates large numbers of GPUs to foreground jobs in bursts to exploit the unevenness in parallelism across layers. Second, GPU multiplexing prioritizes throughput for foreground training jobs, while packing in background training jobs to reclaim underutilized GPU resources, thereby improving cluster-wide utilization. Together, these two ideas enable DeepPool to deliver a 1.2 - 2.3x improvement in total cluster throughput over standard data parallelism with a single task when the cluster scale is large.

updated: Mon May 23 2022 20:51:22 GMT+0000 (UTC)

published: Sun Dec 19 2021 05:18:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト