MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

Jihao Liu; Xin Huang; Jinliang Zheng; Yu Liu; Hongsheng Li

MixMAE: 階層型ビジョントランスフォーマーの効率的な事前トレーニングのための混合およびマスクオートエンコーダー

このホワイトペーパーでは、さまざまな階層型ビジョントランスフォーマーに適用できるシンプルだが効率的な事前トレーニング方法である Mixed and Masked AutoEncoder (MixMAE) を提案します。階層型ビジョントランスフォーマーの既存のマスクイメージモデリング (MIM) メソッドは、入力トークンのランダムなサブセットを特別な [MASK] シンボルに置き換え、破損したイメージから元のイメージトークンを再構築することを目的としています。ただし、[MASK] シンボルを使用すると、マスキング率が大きいため (たとえば、SimMIM では 60%)、トレーニングが大幅に遅くなり、事前トレーニングと微調整の不一致が生じることがわかりました。一方、MAE はそのエンコーダーで [MASK] トークンをまったく導入しませんが、階層的なビジョントランスフォーマーには適用できません。この問題を解決し、階層モデルの事前トレーニングを高速化するために、ある画像のマスクされたトークンを別の画像の可視トークンに置き換えます。つまり、混合画像を作成します。次に、デュアル再構成を実行して、混合入力から 2 つの元の画像を再構成します。これにより、効率が大幅に向上します。 MixMAE はさまざまな階層型 Transformer に適用できますが、このホワイトペーパーでは、大きなウィンドウサイズで Swin Transformer を使用し、巨大なモデルサイズ (600M パラメーターに到達) にスケールアップする方法について説明します。実験結果は、MixMAE が高品質の視覚的表現を効率的に学習できることを示しています。特に、Swin-B/W14 を使用した MixMAE は、600 エポックの事前トレーニングにより、ImageNet-1K で 85.1% のトップ 1 精度を達成します。さらに、他の 6 つのデータセットでの転送パフォーマンスは、MixMAE が以前の一般的な MIM メソッドよりも優れた FLOP/パフォーマンスのトレードオフを持っていることを示しています。コードは https://github.com/Sense-X/MixMIM で入手できます。

In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods. Code is available at https://github.com/Sense-X/MixMIM.

updated: Fri Mar 31 2023 09:26:28 GMT+0000 (UTC)

published: Thu May 26 2022 04:00:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト