Rethinking Hierarchicies in Pre-trained Plain Vision Transformer

Yufei Xu; Jing Zhang; Qiming Zhang; Dacheng Tao

事前トレーニングされた Plain Vision Transformer の階層構造の再考

マスクされた画像モデリング (MIM) を介した自己管理型事前トレーニングビジョントランスフォーマー (ViT) は、非常に効果的であることが証明されています。ただし、カスタマイズされたアルゴリズムは、プレーンな ViT にバニラと単純な MAE を使用する代わりに、GreenMIM などの階層 ViT 用に慎重に設計する必要があります。さらに重要なことは、これらの階層型 ViT はプレーンな ViT の既製の事前トレーニング済みの重みを再利用できないため、それらを事前トレーニングする必要があるため、膨大な量の計算コストが発生し、アルゴリズムと計算の両方が複雑になります。このホワイトペーパーでは、この問題に対処するために、自己管理型の事前トレーニングから階層型アーキテクチャ設計を解きほぐすという新しいアイデアを提案します。単純な ViT を最小限の変更で階層的なものに変換します。技術的には、線形埋め込みレイヤーのストライドを 16 から 4 に変更し、トランスフォーマーブロック間に畳み込み (または単純平均) プーリングレイヤーを追加することで、特徴サイズを 1/4 から 1/32 に順次縮小します。その単純さにもかかわらず、ImageNet、MS COCO、Cityscapes、および ADE20K ベンチマークでの分類、検出、およびセグメンテーションタスクで、それぞれ単純な ViT ベースラインよりも優れています。この予備調査が、既製のチェックポイントを活用することでトレーニング前のコストを回避しながら、効果的な (階層的な) ViT を開発することについて、コミュニティからより多くの注目を集めることができることを願っています。コードとモデルは https://github.com/ViTAE-Transformer/HPViT で公開されます。

Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average) pooling layers between the transformer blocks, thereby reducing the feature size from 1/4 to 1/32 sequentially. Despite its simplicity, it outperforms the plain ViT baseline in classification, detection, and segmentation tasks on ImageNet, MS COCO, Cityscapes, and ADE20K benchmarks, respectively. We hope this preliminary study could draw more attention from the community on developing effective (hierarchical) ViTs while avoiding the pre-training cost by leveraging the off-the-shelf checkpoints. The code and models will be released at https://github.com/ViTAE-Transformer/HPViT.

updated: Thu Nov 03 2022 13:19:23 GMT+0000 (UTC)

published: Thu Nov 03 2022 13:19:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト