MimCo: Masked Image Modeling Pre-training with Contrastive Teacher

Qiang Zhou; Chaohui Yu; Hao Luo; Zhibin Wang; Hao Li

MimCo: Contrastive Teacher を使用したマスクイメージモデリングの事前トレーニング

最近のマスク画像モデリング (MIM) は、入力画像のマスクされた部分を復元するためにターゲットモデルを必要とする自己教師あり学習 (SSL) で多くの注目を集めています。 MIM ベースの事前トレーニング方法は、多くのダウンストリームタスクに転送されたときに新しい最先端のパフォーマンスを達成しますが、視覚化は、特に対照的な学習事前トレーニングに基づくものと比較して、学習された表現が分離しにくいことを示しています。これにより、MIM 事前トレーニング済み表現の線形分離可能性をさらに改善して、事前トレーニングのパフォーマンスを改善できるかどうかを考えるようになります。 MIM と対照学習は異なるデータ拡張とトレーニング戦略を利用する傾向があるため、これら 2 つの口実タスクを組み合わせるのは簡単ではありません。この作業では、2 段階の事前トレーニングを通じて MIM と対照学習を組み合わせた、MimCo という名前の斬新で柔軟な事前トレーニングフレームワークを提案します。具体的には、MimCo は事前トレーニング済みの対照学習モデルを教師モデルとして使用し、パッチレベルと画像レベルの再構成損失の 2 種類の学習ターゲットで事前トレーニングされています。ダウンストリームタスクでの広範な転送実験は、MimCo 事前トレーニングフレームワークの優れたパフォーマンスを示しています。 ViT-S を例にとると、事前トレーニング済みの MoCov3-ViT-S を教師モデルとして使用する場合、MimCo は Imagenet-1K で 82.53% のトップ 1 微調整精度を達成するために 100 エポックの事前トレーニングしか必要としません。最先端の自己管理型学習のカウンターパート。

Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), which requires the target model to recover the masked part of the input image. Although MIM-based pre-training methods achieve new state-of-the-art performance when transferred to many downstream tasks, the visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training. This inspires us to think whether the linear separability of MIM pre-trained representation can be further improved, thereby improving the pre-training performance. Since MIM and contrastive learning tend to utilize different data augmentations and training strategies, combining these two pretext tasks is not trivial. In this work, we propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training. Specifically, MimCo takes a pre-trained contrastive learning model as the teacher model and is pre-trained with two types of learning targets: patch-level and image-level reconstruction losses. Extensive transfer experiments on downstream tasks demonstrate the superior performance of our MimCo pre-training framework. Taking ViT-S as an example, when using the pre-trained MoCov3-ViT-S as the teacher model, MimCo only needs 100 epochs of pre-training to achieve 82.53% top-1 finetuning accuracy on Imagenet-1K, which outperforms the state-of-the-art self-supervised learning counterparts.

updated: Thu Apr 20 2023 07:41:05 GMT+0000 (UTC)

published: Wed Sep 07 2022 10:59:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト