Distribution Normalization: An

Yifei Zhou; Juntao Ren; Fengyu Li; Ramin Zabih; Ser-Nam Lim

分布の正規化: 対照的に学習された視覚言語モデルのための「楽な」テスト時間の増強

Distribution Normalization: An "Effortless" Test-Time Augmentation for Contrastively Learned Visual-language Models

視覚言語の対比学習の分野における進歩により、画像とテキスト表現の間の内積を取るだけで、多くのダウンストリームアプリケーションを効率的かつ正確に実行できるようになりました。最近提案された最も代表的な手法の 1 つである CLIP は、その有効性から急速に広く採用されています。 CLIP は、ポジティブサンプルとネガティブサンプルの両方を考慮した InfoNCE 損失でトレーニングされ、より堅牢な表現空間を学習するのに役立ちます。ただし、このホワイトペーパーでは、内積を取る一般的なダウンストリームプラクティスは、最適化目標のゼロ次近似にすぎず、テスト時間中に情報が失われることを明らかにしています。直感的に、モデルは InfoNCE 損失に基づいて最適化されているため、テスト時間の手順も理想的には一致している必要があります。問題は、推論中にネガティブサンプル情報の類似性をどのように取得できるかということです。分布の正規化 (DN) を提案します。ここでは、テストサンプルのバッチの平均表現を概算し、そのような平均を使用して、InfoNCE 損失の負のサンプルに類似するものを表現します。 DN は再トレーニングや微調整を必要とせず、推論中に簡単に適用できます。さまざまなダウンストリームタスクに関する広範な実験により、内積よりも DN の方が明らかに有利であることが示されています。

Advances in the field of visual-language contrastive learning have made it possible for many downstream applications to be carried out efficiently and accurately by simply taking the dot product between image and text representations. One of the most representative approaches proposed recently known as CLIP has quickly garnered widespread adoption due to its effectiveness. CLIP is trained with an InfoNCE loss that takes into account both positive and negative samples to help learn a much more robust representation space. This paper however reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time. Intuitively, since the model has been optimized based on the InfoNCE loss, test-time procedures should ideally also be in alignment. The question lies in how one can retrieve any semblance of negative samples information during inference. We propose Distribution Normalization (DN), where we approximate the mean representation of a batch of test samples and use such a mean to represent what would be analogous to negative samples in the InfoNCE loss. DN requires no retraining or fine-tuning and can be effortlessly applied during inference. Extensive experiments on a wide variety of downstream tasks exhibit a clear advantage of DN over the dot product.

updated: Wed Feb 22 2023 01:14:30 GMT+0000 (UTC)

published: Wed Feb 22 2023 01:14:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト