Accelerating Distributed ML Training via Selective Synchronization

Sahil Tyagi; Martin Swany

選択的同期による分散 ML トレーニングの高速化

分散トレーニングでは、ディープニューラルネットワーク (DNN) が複数のワーカー上で同時に起動され、バルク同期並列 (BSP) トレーニングの各ステップでのローカル更新を集約します。ただし、BSP は、アグリゲーションの通信コストが高いため、直線的にスケールアウトしません。このオーバーヘッドを軽減するために、FedAvg (FedAvg) や Stale-Synchronous Parallel (SSP) などの代替手段は、同期頻度を減らすか完全に排除しますが、通常は最終精度が低下します。この論文では、DNN トレーニングのための実用的でオーバーヘッドの低い手法である SelSync を紹介します。これは、集約演算を呼び出すか、重要性に基づいてローカル更新を適用することによって、各ステップで通信を発生させるか回避するかを動的に選択します。半同期トレーニングのコンテキストでの収束を向上させるために、SelSync の一部としてさまざまな最適化を提案します。当社のシステムは、BSP と同等以上の精度に収束しながら、トレーニング時間を最大 14 分の 1 に短縮します。

In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present SelSync, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of SelSync to improve convergence in the context of semi-synchronous training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14×.

updated: Sun Jul 16 2023 05:28:59 GMT+0000 (UTC)

published: Sun Jul 16 2023 05:28:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト