DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

Yihao Chen; Xianbiao Qi; Jianan Wang; Lei Zhang

DisCo-CLIP: メモリ効率のよい CLIP トレーニングのための分散コントラスト損失

対照学習モデルをトレーニングする際の対照損失のメモリ消費を削減するために、分散メモリ効率の高い CLIP トレーニングアプローチである DisCo-CLIP を提案します。私たちのアプローチは、コントラスト損失とその勾配計算を 2 つの部分に分解します。1 つは GPU 内勾配を計算し、もう 1 つは GPU 間勾配を計算します。私たちの分解によると、現在の GPU では GPU 内勾配のみが計算されますが、GPU 間勾配はすべての GPU で繰り返し計算されるのではなく、他の GPU から all_reduce を介して収集されます。このようにして、コントラスト損失計算の GPU メモリ消費を \bigO(B^2) から B^2N) に減らすことができます。ここで、B と N はバッチサイズとトレーニングに使用される GPU の数です。このような分散ソリューションは、計算精度を犠牲にすることなく、元の非分散コントラスト損失計算と数学的に同等です。大規模なバッチ CLIP トレーニングでは特に効率的です。たとえば、ViT をトレーニングするために 128 個の A100 40GB GPU を必要とする元の CLIP ソリューションと比較して、DisCo-CLIP は、8 個または 64 個の A100 40GB GPU を使用して、バッチサイズが 32K または 196K の ViT-B/32 モデルの対照的なトレーニングを可能にします。 -バッチサイズが 32K の B/32 モデル。コードは https://github.com/IDEA-Research/DisCo-CLIP で公開されます

We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach, to reduce the memory consumption of contrastive loss when training contrastive learning models. Our approach decomposes the contrastive loss and its gradient computation into two parts, one to calculate the intra-GPU gradients and the other to compute the inter-GPU gradients. According to our decomposition, only the intra-GPU gradients are computed on the current GPU, while the inter-GPU gradients are collected via all_reduce from other GPUs instead of being repeatedly computed on every GPU. In this way, we can reduce the GPU memory consumption of contrastive loss computation from \bigO(B^2) to B^2N), where B and N are the batch size and the number of GPUs used for training. Such a distributed solution is mathematically equivalent to the original non-distributed contrastive loss computation, without sacrificing any computation accuracy. It is particularly efficient for large-batch CLIP training. For instance, DisCo-CLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64 A100 40GB GPUs, compared with the original CLIP solution which requires 128 A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K. The code will be released at https://github.com/IDEA-Research/DisCo-CLIP

updated: Mon Apr 17 2023 17:58:21 GMT+0000 (UTC)

published: Mon Apr 17 2023 17:58:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト