Understanding self-supervised Learning Dynamics without Contrastive Pairs

Yuandong Tian; Xinlei Chen; Surya Ganguli

対照的なペアのない自己教師あり学習ダイナミクスを理解する

自己教師あり学習（SSL）の対照的なアプローチは、同じデータポイントの2つの拡張ビュー（正のペア）間の距離を最小化し、異なるデータポイント（負のペア）からのビューを最大化することによって表現を学習しますが、最近の非対照的なSSL（たとえば、 BYOLとSimSiam）は、追加の学習可能な予測子と停止勾配操作を使用して、負のペアなしで優れたパフォーマンスを示します。基本的な疑問が生じます：なぜこれらの方法は自明表現に崩壊しないのですか？簡単な理論的研究を通じてこの質問に答え、勾配トレーニングなしで、入力の統計に基づいて線形予測子を直接設定する新しいアプローチ、DirectPredを提案します。 ImageNetでは、BatchNormを使用するより複雑な2層の非線形予測子と同等のパフォーマンスを発揮し、300エポックのトレーニングで線形予測子を2.5％（60エポックで5％）上回ります。 DirectPredは、単純な線形ネットワークにおける非コントラストSSLの非線形学習ダイナミクスの理論的研究によって動機付けられています。私たちの研究は、非対照的なSSLメソッドがどのように学習するか、表現の崩壊をどのように回避するか、予測ネットワーク、停止勾配、指数移動平均、重みの減衰などの複数の要因がすべてどのように作用するかについての概念的な洞察をもたらします。私たちの単純な理論は、STL-10とImageNetの両方での実際のアブレーション研究の結果を要約しています。コードはhttps://github.com/facebookresearch/luckmatters/tree/master/sslでリリースされています。

While contrastive approaches of self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point (positive pairs) and maximizing views from different data points (negative pairs), recent non-contrastive SSL (e.g., BYOL and SimSiam) show remarkable performance without negative pairs, with an extra learnable predictor and a stop-gradient operation. A fundamental question arises: why do these methods not collapse into trivial representations? We answer this question via a simple theoretical study and propose a novel approach, DirectPred, that directly sets the linear predictor based on the statistics of its inputs, without gradient training. On ImageNet, it performs comparably with more complex two-layer non-linear predictors that employ BatchNorm and outperforms a linear predictor by 2.5% in 300-epoch training (and 5% in 60-epoch). DirectPred is motivated by our theoretical study of the nonlinear learning dynamics of non-contrastive SSL in simple linear networks. Our study yields conceptual insights into how non-contrastive SSL methods learn, how they avoid representational collapse, and how multiple factors, like predictor networks, stop-gradients, exponential moving averages, and weight decay all come into play. Our simple theory recapitulates the results of real-world ablation studies in both STL-10 and ImageNet. Code is released https://github.com/facebookresearch/luckmatters/tree/master/ssl.

updated: Fri Oct 08 2021 02:41:50 GMT+0000 (UTC)

published: Fri Feb 12 2021 22:57:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト