Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

Siyuan Li; Di Wu; Fang Wu; Zelin Zang; Baigui Sun; Hao Li; Xuansong Xie; Stan. Z. Li

アーキテクチャにとらわれないマスク画像モデリング -- ViT から CNN に戻る

マスクされた画像モデリング (MIM) は、新しい自己教師あり事前トレーニング方法であり、ビジョントランスフォーマー (ViT) を使用した多数のダウンストリームビジョンタスクで印象的な成功を収めています。その根底にある考え方は単純です。入力画像の一部がランダムにマスクされ、プレテキストタスクによって再構築されます。ただし、MIM の背後にある動作原理は十分に説明されておらず、以前の研究では、MIM は主に Transformer ファミリで機能しますが、CNN とは互換性がないと主張しています。この論文では、最初にパッチ間の相互作用を調べて、どのような知識が学習され、MIM タスクを介してどのように取得されるかを理解します。 MIM は基本的に、パッチ間のより良い中次相互作用を学習し、より一般化された機能を抽出するようにモデルに教えていることがわかります。この事実に基づいて、Transformer と CNN の両方と統一的に互換性のある Architecture-agnostic Masked Image Modeling フレームワーク (A^2MIM) を提案します。一般的なベンチマークでの広範な実験は、A^2MIM が明示的な設計なしでより良い表現を学習し、バックボーンモデルに、Transformer と CNN の両方のさまざまなダウンストリームタスクに転送するより強力な機能を与えることを示しています。

Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers (ViTs). Its underlying idea is simple: a portion of the input image is randomly masked out and then reconstructed via the pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this paper, we first study interactions among patches to understand what knowledge is learned and how it is acquired via the MIM task. We observe that MIM essentially teaches the model to learn better middle-order interactions among patches and extract more generalized features. Based on this fact, we propose an Architecture-Agnostic Masked Image Modeling framework (A^2MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that our A^2MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks for both Transformers and CNNs.

updated: Thu Sep 29 2022 20:41:42 GMT+0000 (UTC)

published: Fri May 27 2022 12:42:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト