SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

Gang Li; Heliang Zheng; Daqing Liu; Bing Su; Changwen Zheng

SemMAE：マスクされたオートエンコーダを学習するためのセマンティックガイド付きマスキング

最近、マスクされた言語モデリングに追いつくために、マスクされた画像モデリングにおいて大きな進歩が見られた。ただし、NLPの単語とは異なり、画像の意味分解がないため、マスクされた自動エンコード（MAE）は視覚と言語で異なります。この論文では、単語の潜在的な視覚的類似物、すなわち意味論的部分を探求し、意味論的ガイド付きマスキング戦略を提案することにより、意味論的情報をMAEのトレーニングプロセスに統合します。広く採用されているランダムマスキングと比較して、私たちのマスキング戦略は、ネットワークを徐々に導き、さまざまな情報、つまり、パーツ内のパターンからパーツ間の関係までを学習することができます。特に、これは2つのステップで実現します。 1）セマンティックパーツ学習：ViTベースのエンコーダーのマルチヘッドアテンションを活用および改良することにより、セマンティックパーツを取得するための自己監視型パーツ学習方法を設計します。 2）セマンティックガイド付きMAE（SemMAE）トレーニング：各部分のパッチの一部のマスキングから画像内の（全体の）部分の一部のマスキングまで、さまざまなマスキング戦略を設計します。さまざまな視覚タスクに関する広範な実験は、SemMAEが意味情報を統合することによってより良い画像表現を学習できることを示しています。特に、SemMAEはImageNet-1kで84.5％の微調整精度を達成し、バニラMAEを1.4％上回っています。セマンティックセグメンテーションときめ細かい認識タスクでは、SemMAEも大幅な改善をもたらし、最先端のパフォーマンスを実現します。

Recently, significant progress has been made in masked image modeling to catch up to masked language modeling. However, unlike words in NLP, the lack of semantic decomposition of images still makes masked autoencoding (MAE) different between vision and language. In this paper, we explore a potential visual analogue of words, i.e., semantic parts, and we integrate semantic information into the training process of MAE by proposing a Semantic-Guided Masking strategy. Compared to widely adopted random masking, our masking strategy can gradually guide the network to learn various information, i.e., from intra-part patterns to inter-part relations. In particular, we achieve this in two steps. 1) Semantic part learning: we design a self-supervised part learning method to obtain semantic parts by leveraging and refining the multi-head attention of a ViT-based encoder. 2) Semantic-guided MAE (SemMAE) training: we design a masking strategy that varies from masking a portion of patches in each part to masking a portion of (whole) parts in an image. Extensive experiments on various vision tasks show that SemMAE can learn better image representation by integrating semantic information. In particular, SemMAE achieves 84.5% fine-tuning accuracy on ImageNet-1k, which outperforms the vanilla MAE by 1.4%. In the semantic segmentation and fine-grained recognition tasks, SemMAE also brings significant improvements and yields the state-of-the-art performance.

updated: Tue Jun 21 2022 09:08:32 GMT+0000 (UTC)

published: Tue Jun 21 2022 09:08:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト