Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

Siyuan Li; Di Wu; Fang Wu; Zelin Zang; Stan. Z. Li

アーキテクチャに依存しないマスクイメージモデリング -- ViT から CNN へ

新しい自己教師付き事前トレーニング手法であるマスクされた画像モデリングは、Vision トランスフォーマーを使用した数多くの下流のビジョンタスクにわたって目覚ましい成功を収めています。その基本的な考え方は単純です。入力画像の一部がマスクされてから、プレテキストタスクによって再構築されます。ただし、MIM の背後にある動作原理は十分に説明されておらず、以前の研究では、MIM は主に Transformer ファミリで機能するが、CNN とは互換性がないことが主張されています。この研究では、MIM が本質的に、より一般化された特徴抽出のためにパッチ間のより適切な中次相互作用を学習するようにモデルを教えていることがわかります。次に、統合された方法でトランスフォーマーと CNN の両方と互換性のある、アーキテクチャに依存しないマスクイメージモデリングフレームワーク (A^2MIM) を提案します。人気のベンチマークに関する広範な実験により、A^2MIM は明示的な設計を行わずにより良い表現を学習し、バックボーンモデルにさまざまな下流タスクに転送するためのより強力な機能を与えることが示されました。

Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers. Its underlying idea is simple: a portion of the input image is masked out and then reconstructed via a pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this work, we observe that MIM essentially teaches the model to learn better middle-order interactions among patches for more generalized feature extraction. We then propose an Architecture-Agnostic Masked Image Modeling framework (A^2MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that A^2MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks.

updated: Fri Jun 02 2023 10:21:16 GMT+0000 (UTC)

published: Fri May 27 2022 12:42:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト