Self-Supervised Learning by Estimating Twin Class Distributions

Feng Wang; Tao Kong; Rufeng Zhang; Huaping Liu; Hang Li

ツインクラス分布を推定することによる自己監視学習

大規模なラベルなしデータセットをエンドツーエンドで分類することにより、シンプルで理論的に説明可能な自己監視表現学習法であるTWISTを紹介します。ソフトマックス演算で終了するシャムネットワークを使用して、2つの拡張画像のツインクラス分布を生成します。監督なしで、一貫性を保つためにさまざまな拡張のクラス分布を強制します。ただし、拡張間の発散を単純に最小化すると、解が崩壊します。つまり、すべての画像に対して同じクラスの確率分布が出力されます。この場合、入力画像に関する情報は残りません。この問題を解決するために、入力予測とクラス予測の間の相互情報量を最大化することを提案します。具体的には、各サンプルの分布のエントロピーを最小化して各サンプルのクラス予測を断定的にし、平均分布のエントロピーを最大化してさまざまなサンプルの予測を多様化します。このようにして、TWISTは、非対称ネットワーク、停止勾配操作、運動量エンコーダーなどの特定の設計なしで、崩壊したソリューションを自然に回避できます。その結果、TWISTは、幅広いタスクで最先端の方法よりも優れています。特に、TWISTは半教師あり学習で驚くほど優れたパフォーマンスを発揮し、ResNet-50をバックボーンとして使用した1％ImageNetラベルで61.2％のトップ1精度を達成し、6.2％の絶対的な改善で以前の最良の結果を上回りました。コードと事前トレーニング済みモデルは、https：//github.com/bytedance/TWISTで提供されています。

We present TWIST, a simple and theoretically explainable self-supervised representation learning method by classifying large-scale unlabeled datasets in an end-to-end way. We employ a siamese network terminated by a softmax operation to produce twin class distributions of two augmented images. Without supervision, we enforce the class distributions of different augmentations to be consistent. However, simply minimizing the divergence between augmentations will cause collapsed solutions, i.e., outputting the same class probability distribution for all images. In this case, no information about the input image is left. To solve this problem, we propose to maximize the mutual information between the input and the class predictions. Specifically, we minimize the entropy of the distribution for each sample to make the class prediction for each sample assertive and maximize the entropy of the mean distribution to make the predictions of different samples diverse. In this way, TWIST can naturally avoid the collapsed solutions without specific designs such as asymmetric network, stop-gradient operation, or momentum encoder. As a result, TWIST outperforms state-of-the-art methods on a wide range of tasks. Especially, TWIST performs surprisingly well on semi-supervised learning, achieving 61.2% top-1 accuracy with 1% ImageNet labels using a ResNet-50 as backbone, surpassing previous best results by an absolute improvement of 6.2%. Codes and pre-trained models are given on: https://github.com/bytedance/TWIST

updated: Mon Dec 06 2021 06:52:45 GMT+0000 (UTC)

published: Thu Oct 14 2021 14:39:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト