A Unified View of Masked Image Modeling

Zhiliang Peng; Li Dong; Hangbo Bao; Qixiang Ye; Furu Wei

マスクされた画像モデリングの統一ビュー

マスクされたイメージモデリングは、大規模なビジョントランスフォーマーをトレーニングする際のラベルに飢えた問題を解消する大きな可能性を示しており、さまざまなダウンストリームタスクで優れたパフォーマンスを達成しています。この作業では、既存の方法を再検討した後、マスクされた画像モデリングの統一されたビューを提案します。統一されたビューの下で、MaskDistill と呼ばれるシンプルで効果的な方法を紹介します。これは、マスクされた位置で教師モデルから正規化されたセマンティック機能を再構築し、破損した入力画像を調整します。画像分類とセマンティックセグメンテーションに関する実験結果は、MaskDistill が最先端の方法と同等またはそれ以上のパフォーマンスを達成することを示しています。ヒュージビジョントランスフォーマーを使用して 300 エポックを事前トレーニングすると、MaskDistill は ImageNet-1k (224 サイズ) で 88.3% の微調整トップ 1 精度を取得し、ADE20k (512 サイズ) で 58.8% のセマンティックセグメンテーション mIoU メトリックを取得します。コードと事前トレーニング済みのモデルは、https://aka.ms/unimim で入手できます。

Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8% semantic segmentation mIoU metric on ADE20k (512 size). The code and pretrained models will be available at https://aka.ms/unimim.

updated: Wed Oct 19 2022 14:59:18 GMT+0000 (UTC)

published: Wed Oct 19 2022 14:59:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト