Improving Masked Autoencoders by Learning Where to Mask

Haijian Chen; Wendong Zhang; Yunbo Wang; Xiaokang Yang

どこをマスクするかを学習して、マスクされたオートエンコーダーを改善する

マスクされた画像モデリングは、視覚データの有望な自己教師あり学習方法です。これは通常、ランダムマスクを含む画像パッチに基づいて構築されます。これは、それらの間の情報密度の変動をほとんど無視します。問題は、ランダムサンプリングよりも優れたマスキング戦略はあるのか、それをどのように学習できるのかということです。この問題を経験的に研究し、最初に、マスクサンプリングにオブジェクト中心の事前確率を導入すると、学習した表現が大幅に改善されることがわかりました。この観察に着想を得て、Gumbel-Softmax を使用して敵対的に訓練されたマスクジェネレーターとマスク誘導画像モデリングプロセスを連結する完全に微分可能なフレームワークである AutoMAE を紹介します。このようにして、私たちのアプローチは、さまざまな画像に対してより高い情報密度を持つパッチを適応的に見つけ、さらに画像再構成から得られる情報ゲインとその実際的なトレーニングの難しさとのバランスを取ることができます。私たちの実験では、AutoMAE は、標準的な自己教師ありベンチマークとダウンストリームタスクで効果的な事前トレーニングモデルを提供することが示されています。

Masked image modeling is a promising self-supervised learning method for visual data. It is typically built upon image patches with random masks, which largely ignores the variation of information density between them. The question is: Is there a better masking strategy than random sampling and how can we learn it? We empirically study this problem and initially find that introducing object-centric priors in mask sampling can significantly improve the learned representations. Inspired by this observation, we present AutoMAE, a fully differentiable framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In this way, our approach can adaptively find patches with higher information density for different images, and further strike a balance between the information gain obtained from image reconstruction and its practical training difficulty. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.

updated: Sun Mar 12 2023 05:28:55 GMT+0000 (UTC)

published: Sun Mar 12 2023 05:28:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト