Understanding Masked Autoencoders via Hierarchical Latent Variable Models

Lingjing Kong; Martin Q. Ma; Guangyi Chen; Eric P. Xing; Yuejie Chi; Louis-Philippe Morency; Kun Zhang

階層的潜在変数モデルによるマスクされたオートエンコーダーの理解

マスクされた画像領域の再構成に基づくシンプルで効果的な自己教師あり学習フレームワークであるマスクオートエンコーダー (MAE) は、最近さまざまな視覚タスクで顕著な成功を収めています。 MAE に関する興味深い経験的観察が出現したにもかかわらず、理論的な原理的な理解はまだ不足しています。この研究では、既存の経験的洞察を正式に特徴付けて正当化し、MAE の理論的保証を提供します。基礎となるデータ生成プロセスを階層型潜在変数モデルとして定式化し、合理的な仮定の下で MAE が階層モデル内の一連の潜在変数を明らかに識別することを示し、MAE がピクセルから高レベルの情報を抽出できる理由を説明します。さらに、MAE の主要なハイパーパラメータ (マスキング率とパッチサイズ) がどの真の潜在変数を回復するかを決定し、表現内の意味情報のレベルに影響を与える方法を示します。具体的には、マスキング率が極端に大きいか小さいと、必然的に低レベルの表現になります。私たちの理論は、既存の経験的観察の一貫した説明を提供し、潜在的な経験的改善とマスキング再構成パラダイムの基本的な制限についての洞察を提供します。私たちは理論的な洞察を検証するために広範な実験を実施します。

Masked autoencoder (MAE), a simple and effective self-supervised learning framework based on the reconstruction of masked image regions, has recently achieved prominent success in a variety of vision tasks. Despite the emergence of intriguing empirical observations on MAE, a theoretically principled understanding is still lacking. In this work, we formally characterize and justify existing empirical insights and provide theoretical guarantees of MAE. We formulate the underlying data-generating process as a hierarchical latent variable model and show that under reasonable assumptions, MAE provably identifies a set of latent variables in the hierarchical model, explaining why MAE can extract high-level information from pixels. Further, we show how key hyperparameters in MAE (the masking ratio and the patch size) determine which true latent variables to be recovered, therefore influencing the level of semantic information in the representation. Specifically, extremely large or small masking ratios inevitably lead to low-level representations. Our theory offers coherent explanations of existing empirical observations and provides insights for potential empirical improvements and fundamental limitations of the masking-reconstruction paradigm. We conduct extensive experiments to validate our theoretical insights.

updated: Thu Jun 08 2023 03:00:10 GMT+0000 (UTC)

published: Thu Jun 08 2023 03:00:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト