Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Jun Chen; Ming Hu; Boyang Li; Mohamed Elhoseiny

ローカルマスク再構成による効率的な自己監視ビジョン事前トレーニング

コンピュータビジョンの自己監視学習は、驚異的な進歩を遂げ、画像分類、セマンティックセグメンテーション、オブジェクト検出などの多くのダウンストリームビジョンタスクを改善しました。これらの中で、MAEやBEiTなどの生成的自己監視ビジョン学習アプローチは有望なパフォーマンスを示しています。ただし、それらのグローバルなマスクされた再構築メカニズムは、計算量が多くなります。この問題に対処するために、ローカルマスク再構成（LoMaR）を提案します。これは、単純なTransformerエンコーダーで7×7パッチの小さなウィンドウ内でマスク再構成を実行し、グローバルと比較して効率と精度のトレードオフを改善する、シンプルで効果的なアプローチです。画像全体のマスクされた再構成。広範な実験により、LoMaRはImageNet-1K分類で84.1％のトップ1精度に達し、MAEを0.5％上回っています。事前にトレーニングされたLoMaRを384×384の画像で微調整した後、トップ1の精度は85.4％に達し、MAEを0.6％上回ります。 MS COCOでは、LoMaRはオブジェクト検出で0.5 AP ^ box、インスタンスセグメンテーションで0.5 AP^maskだけMAEを上回ります。 LoMaRは、高解像度画像の事前トレーニングで特に計算効率が高く、たとえば、448×448画像の事前トレーニングで分類精度が0.2％高く、MAEより3.1倍高速です。このローカルマスク再構成学習メカニズムは、他の生成的自己監視学習アプローチに簡単に統合できます。私たちのコードはhttps://github.com/junchen14/LoMaRで公開されています。

Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7×7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384×384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 AP^box on object detection and 0.5 AP^mask on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1× faster than MAE with 0.2% higher classification accuracy on pretraining 448×448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code is publicly available in https://github.com/junchen14/LoMaR.

updated: Mon Jun 20 2022 13:28:04 GMT+0000 (UTC)

published: Wed Jun 01 2022 22:46:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト