ConvMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao; Teli Ma; Hongsheng Li; Jifeng Dai; Yu Qiao

ConvMAE：マスクされた畳み込みとマスクされたオートエンコーダーの出会い

ビジョントランスフォーマー（ViT）は、さまざまなビジョンタスクに広く採用されているアーキテクチャーになります。機能の事前トレーニングとマルチスケールハイブリッドコンボリューショントランスフォーマーアーキテクチャのマスクされた自動エンコーディングは、ViTの可能性をさらに解き放ち、画像の分類、検出、セマンティックセグメンテーションに関する最先端のパフォーマンスを実現します。この論文では、ConvMAEフレームワークは、マルチスケールハイブリッド畳み込みトランスフォーマーがマスク自動エンコードスキームを介してより識別力のある表現を学習できることを示しています。ただし、元のマスキング戦略を直接使用すると、計算コストが高くなり、事前トレーニングと微調整の不一致が発生します。この問題に取り組むために、畳み込みブロックでの情報漏えいを防ぐためにマスクされた畳み込みを採用しています。計算効率を確保するために、単純なブロック単位のマスキング戦略が提案されています。また、エンコーダのマルチスケール機能をより直接的に監視して、マルチスケール機能を強化することを提案します。事前にトレーニングされたConvMAEモデルに基づいて、ConvMAE-Baseは、MAE-Baseと比較してImageNet-1Kの微調整精度を1.4％向上させます。オブジェクト検出では、25エポックのみに微調整されたConvMAE-Baseは、100エポックに微調整されたMAE-Baseをそれぞれ2.9％のボックスAPと2.2％のマスクAPで上回っています。コードと事前トレーニング済みモデルは、https：//github.com/Alpha-VL/ConvMAEで入手できます。

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE.

updated: Sun May 08 2022 15:12:19 GMT+0000 (UTC)

published: Sun May 08 2022 15:12:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト