Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

Yuxin Fang; Shusheng Yang; Shijie Wang; Yixiao Ge; Ying Shan; Xinggang Wang

オブジェクト検出のためのマスクされた画像モデリングを備えたバニラビジョントランスフォーマーを解き放つ

マスクされた画像モデリング（MIM）の事前トレーニングされたバニラビジョントランスフォーマー（ViT）をオブジェクト検出に効率的かつ効果的に適応させるアプローチを提示します。これは、2つの新しい観察に基づいています。（i）MIMの事前トレーニングされたバニラViTエンコーダーランダムにサンプリングされた部分的な観測値、たとえば入力埋め込みの25％〜50％でも、困難なオブジェクトレベルの認識シナリオで驚くほどうまく機能します。（ii）シングルスケールViTからオブジェクト検出用のマルチスケール表現を構築するために、ランダムに初期化されたコンパクトな畳み込みステムは、事前にトレーニングされたラージカーネルパッチ化ステムに取って代わり、その中間機能は当然、それ以上のアップサンプリングやその他の操作なしでピラミッドネットワークを備えています。事前にトレーニングされたViTは、特徴抽出器全体ではなく、検出器のバックボーンの3番目のステージとしてのみ見なされます。これにより、ConvNet-ViTハイブリッド特徴抽出器が作成されます。 MIMDetという名前の提案された検出器は、MIMの事前トレーニング済みバニラViTがCOCOで2.5ボックスAPおよび2.6マスクAPによって階層型Swin Transformerを上回ることを可能にし、より控えめなファインを使用して以前の最適なバニラViT検出器と比較してより良い結果を達成します。 2.8倍速く収束しながらレシピを調整します。コードと事前トレーニング済みモデルは、https：//github.com/hustvl/MIMDetで入手できます。

We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e.g., only 25% ∼ 50% of the input embeddings. (ii) In order to construct multi-scale representations for object detection from single-scale ViT, a randomly initialized compact convolutional stem supplants the pre-trained large kernel patchify stem, and its intermediate features can naturally serve as the higher resolution inputs of a feature pyramid network without further upsampling or other manipulations. While the pre-trained ViT is only regarded as the 3^rd-stage of our detector's backbone instead of the whole feature extractor. This results in a ConvNet-ViT hybrid feature extractor. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.5 box AP and 2.6 mask AP on COCO, and achieves better results compared with the previous best adapted vanilla ViT detector using a more modest fine-tuning recipe while converging 2.8× faster. Code and pre-trained models are available at https://github.com/hustvl/MIMDet.

updated: Thu May 19 2022 03:41:11 GMT+0000 (UTC)

published: Wed Apr 06 2022 17:59:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト