Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget

Johannes Lehner; Benedikt Alkin; Andreas Fürst; Elisabeth Rumetshofer; Lukas Miklautz; Sepp Hochreiter

対照的なチューニング: マスクされたオートエンコーダーを忘れさせるためのちょっとした助け

Masked Autoencoders (MAE) などの Masked Image Modeling (MIM) メソッドは、入力の豊富な表現を効率的に学習します。ただし、下流のタスクに適応するには、豊富な機能がオブジェクトだけでなく関連性の低い画像の背景もキャプチャするため、十分な量のラベル付きデータが必要です。対照的に、インスタンス識別 (ID) メソッドはオブジェクトに焦点を当てています。この作業では、MIM の効率とスケーラビリティを ID の機能と組み合わせて、大量のラベル付きデータがない場合にダウンストリーム分類を実行する方法を研究します。この目的のために、Masked Autoencoder Contrastive Tuning (MAE-CT) を導入します。これは、Nearest Neighbor Contrastive Learning (NNCLR) を事前トレーニング済み MAE に適用する順次アプローチです。 MAE-CT は、豊富な機能を調整して、ラベルを使用せずにオブジェクトのセマンティッククラスターを形成します。大規模で巨大なビジョントランスフォーマー (ViT) モデルに適用された MAE-CT は、線形プロービング、k-NN、ローショット分類の精度、および教師なしクラスタリングの精度において、ImageNet でトレーニングされた以前の自己教師ありメソッドに匹敵するか、それを上回ります。特に、追加の画像拡張なしで同様の結果を達成できます。 ID メソッドは通常、ショートカット学習を回避するために手作りの拡張に依存していますが、最近傍検索で十分であり、このデータ駆動型の拡張効果はモデルのサイズによって改善されることがわかりました。 MAE-CT は計算効率が高いです。たとえば、MAE 事前トレーニング済みの ViT-L/16 から始めて、MAE-CT は ImageNet 1% のローショット精度を 67.7% から 72.6% に、線形プロービング精度を 76.0% から 80.2% に、k-NN 精度を8 つの A100 GPU を使用して、わずか 5 時間で 60.6% から 79.1% に。

Masked Image Modeling (MIM) methods, like Masked Autoencoders (MAE), efficiently learn a rich representation of the input. However, for adapting to downstream tasks, they require a sufficient amount of labeled data since their rich features capture not only objects but also less relevant image background. In contrast, Instance Discrimination (ID) methods focus on objects. In this work, we study how to combine the efficiency and scalability of MIM with the ability of ID to perform downstream classification in the absence of large amounts of labeled data. To this end, we introduce Masked Autoencoder Contrastive Tuning (MAE-CT), a sequential approach that applies Nearest Neighbor Contrastive Learning (NNCLR) to a pre-trained MAE. MAE-CT tunes the rich features such that they form semantic clusters of objects without using any labels. Applied to large and huge Vision Transformer (ViT) models, MAE-CT matches or excels previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy. Notably, similar results can be achieved without additional image augmentations. While ID methods generally rely on hand-crafted augmentations to avoid shortcut learning, we find that nearest neighbor lookup is sufficient and that this data-driven augmentation effect improves with model size. MAE-CT is compute efficient. For instance, starting from a MAE pre-trained ViT-L/16, MAE-CT increases the ImageNet 1% low-shot accuracy from 67.7% to 72.6%, linear probing accuracy from 76.0% to 80.2% and k-NN accuracy from 60.6% to 79.1% in just five hours using eight A100 GPUs.

updated: Thu Apr 20 2023 17:51:09 GMT+0000 (UTC)

published: Thu Apr 20 2023 17:51:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト