PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

Yuan Liu; Songyang Zhang; Jiacheng Chen; Kai Chen; Dahua Lin

PixMIM: マスクされた画像モデリングにおけるピクセル再構成の再考

Masked Image Modeling (MIM) は、Masked Autoencoders (MAE) と BEiT の出現により、有望な進歩を遂げました。ただし、その後の作業では、新しい補助タスクや追加の事前トレーニング済みモデルを使用してフレームワークが複雑になり、必然的に計算オーバーヘッドが増加しました。このホワイトペーパーでは、ピクセル再構成の観点から MIM の基本的な分析を行い、入力画像パッチと再構成ターゲットを調べ、2 つの重要であるがこれまで見過ごされていたボトルネックを強調します。この分析に基づいて、2 つの戦略を必要とする非常にシンプルで効果的な方法を提案します。 MIM トレーニングでフォアグラウンドが失われる問題を軽減するための変換戦略。 {\私たちの方法は、ごくわずかな追加計算で、ほとんどの既存のピクセルベースの MIM アプローチ (つまり、生の画像を再構成ターゲットとして使用) に簡単に統合できます。付属品がなければ、私たちの方法は、さまざまなダウンストリームタスク全体で、3 つの MIM アプローチ、MAE、ConvMAE、および LSMAE を一貫して改善します。この効果的なプラグアンドプレイ方式は、自己教師あり学習の強力なベースラインとして機能し、MIM フレームワークの将来の改善のための洞察を提供すると考えています。コードとモデルは、https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/pixmim で入手できます。

Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. However, subsequent works have complicated the framework with new auxiliary tasks or extra pre-trained models, inevitably increasing computational overhead. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction, which examines the input image patches and reconstruction target, and highlights two critical but previously overlooked bottlenecks. Based on this analysis, we propose a remarkably simple and effective method, that entails two strategies: 1) filtering the high-frequency components from the reconstruction target to de-emphasize the network's focus on texture-rich details and 2) adopting a conservative data transform strategy to alleviate the problem of missing foreground in MIM training. {\ourmethod can be easily integrated into most existing pixel-based MIM approaches (i.e. , using raw images as reconstruction target) with negligible additional computation. Without bells and whistles, our method consistently improves three MIM approaches, MAE, ConvMAE, and LSMAE, across various downstream tasks. We believe this effective plug-and-play method will serve as a strong baseline for self-supervised learning and provide insights for future improvements of the MIM framework. Code and models are available at https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/pixmim.

updated: Fri Mar 24 2023 05:37:41 GMT+0000 (UTC)

published: Sat Mar 04 2023 13:38:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト