DeepMIM: Deep Supervision for Masked Image Modeling

Sucheng Ren; Fangyun Wei; Samuel Albanie; Zheng Zhang; Han Hu

DeepMIM: マスクされた画像モデリングのための深い監視

ニューラルネットワークの中間機能への追加の監視を含むディープ監視は、初期のディープラーニング時代の画像分類で広く使用されていました。これは、トレーニングの難しさを大幅に軽減し、通常のトレーニングよりも勾配消失を回避するなどの最適化を容易にするためです。それにもかかわらず、正規化技術と残りの接続の出現により、画像分類における深い監督は徐々に段階的に廃止されました.このホワイトペーパーでは、マスクアンドプレディクトスキームを介してビジョントランスフォーマー (ViT) を事前トレーニングするマスクイメージモデリング (MIM) の深い監視を再検討します。実験的に、深い監視が浅い層を駆動してより意味のある表現を学習させ、モデルの収束を加速し、注意の多様性を拡大することがわかりました。 DeepMIM と呼ばれる私たちのアプローチは、各レイヤーの表現能力を大幅に向上させます。さらに、DeepMIM は、さまざまな再構成ターゲットにわたる多くの MIM モデルと互換性があります。たとえば、ViT-B を使用すると、MAE の DeepMIM は ImageNet で 84.2 のトップ 1 精度を達成し、MAE を +0.6 上回っています。 DeepMIM をより強力なトークナイザー CLIP と組み合わせることで、私たちのモデルは、画像分類 (ImageNet-1K で 85.6 のトップ 1 精度、MAE-CLIP を +0.8 上回る)、オブジェクト検出など、さまざまなダウンストリームタスクで最先端のパフォーマンスを実現します。 (COCO で 52.8 APbox) およびセマンティックセグメンテーション (ADE20K で 53.1 mIoU)。コードとモデルは、https://github.com/OliverRensu/DeepMIM で入手できます。

Deep supervision, which involves extra supervisions to the intermediate features of a neural network, was widely used in image classification in the early deep learning era since it significantly reduces the training difficulty and eases the optimization like avoiding gradient vanish over the vanilla training. Nevertheless, with the emergence of normalization techniques and residual connection, deep supervision in image classification was gradually phased out. In this paper, we revisit deep supervision for masked image modeling (MIM) that pre-trains a Vision Transformer (ViT) via a mask-and-predict scheme. Experimentally, we find that deep supervision drives the shallower layers to learn more meaningful representations, accelerates model convergence, and expands attention diversities. Our approach, called DeepMIM, significantly boosts the representation capability of each layer. In addition, DeepMIM is compatible with many MIM models across a range of reconstruction targets. For instance, using ViT-B, DeepMIM on MAE achieves 84.2 top-1 accuracy on ImageNet, outperforming MAE by +0.6. By combining DeepMIM with a stronger tokenizer CLIP, our model achieves state-of-the-art performance on various downstream tasks, including image classification (85.6 top-1 accuracy on ImageNet-1K, outperforming MAE-CLIP by +0.8), object detection (52.8 APbox on COCO) and semantic segmentation (53.1 mIoU on ADE20K). Code and models are available at https://github.com/OliverRensu/DeepMIM.

updated: Wed Mar 15 2023 17:59:55 GMT+0000 (UTC)

published: Wed Mar 15 2023 17:59:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト