A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

Shlok Mishra; Joshua Robinson; Huiwen Chang; David Jacobs; Aaron Sarna; Aaron Maschinot; Dilip Krishnan

視覚的表現を学習するためのシンプルで効率的かつスケーラブルなコントラストマスクオートエンコーダー

視覚表現の自己教師あり学習のためのシンプルで効率的かつスケーラブルな方法である CAN を紹介します。私たちのフレームワークは、(C) 対照的な学習、(A) マスクされたオートエンコーダー、および (N) 拡散モデルで使用されるノイズ予測アプローチの最小限で概念的にクリーンな統合です。学習メカニズムは互いに補完的です。対照的な学習は、画像サンプルのバッチ全体で埋め込み空間を形作ります。マスクされたオートエンコーダーは、単一の画像サンプルの低周波空間相関の再構成に焦点を当てています。また、ノイズ予測により、画像の高周波成分の再構成が促進されます。組み合わせたアプローチにより、堅牢でスケーラブルで実装が簡単なアルゴリズムが得られます。トレーニングプロセスは対称的で、両方のビューのパッチの 50% がランダムにマスクされ、以前の対照的な学習方法よりも大幅に効率が向上します。広範な実証研究により、CAN は、転移学習とロバストネスタスクの線形評価と微調整評価の両方で強力なダウンストリームパフォーマンスを達成することが実証されています。 CAN は、ImageNet での事前トレーニング時に MAE および SimCLR よりも優れていますが、JFT-300M などの大規模なキュレーションされていないデータセットでの事前トレーニングには特に役立ちます。前。 ViT-L モデルの ImageNet での微調整されたパフォーマンスは、SimCLR の 85.5%、MAE の 85.4% と比較して 86.1% です。 SimCLR の全体的な FLOP 負荷は、ViT-L モデルの CAN よりも 70% 高くなります。

We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are complementary to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with 50% of patches in both views being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies demonstrate that CAN achieves strong downstream performance under both linear and finetuning evaluations on transfer learning and robustness tasks. CAN outperforms MAE and SimCLR when pre-training on ImageNet, but is especially useful for pre-training on larger uncurated datasets such as JFT-300M: for linear probe on ImageNet, CAN achieves 75.4% compared to 73.4% for SimCLR and 64.1% for MAE. The finetuned performance on ImageNet of our ViT-L model is 86.1%, compared to 85.5% for SimCLR, and 85.4% for MAE. The overall FLOPs load of SimCLR is 70% higher than CAN for ViT-L models.

updated: Sun Oct 30 2022 16:21:22 GMT+0000 (UTC)

published: Sun Oct 30 2022 16:21:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト