Centroid-centered Modeling for Efficient Vision Transformer Pre-training

Xin Yan; Zuchao Li; Lefei Zhang; Bo Du; Dacheng Tao

効率的なビジョントランスフォーマーの事前トレーニングのための重心中心のモデリング

Masked Image Modeling (MIM) は、Vision Transformer (ViT) を使用した新しい自己教師ありビジョン事前トレーニングパラダイムです。以前の作品は、それぞれ元のピクセルまたはパラメトリックトークナイザーモデルからの個別のビジュアルトークンを使用して、ピクセルベースまたはトークンベースにすることができます。提案されたアプローチである CCViT は、k-means クラスタリングを活用して、トークナイザーモデルの教師付きトレーニングなしで画像モデリングの重心を取得します。重心は、パッチピクセルとインデックストークンを表し、局所不変性の特性を持っています。ノンパラメトリックセントロイドトークナイザーは、作成に数秒しかかからず、トークンの推論が高速です。具体的には、パッチマスキングとセントロイド置換戦略を採用して破損した入力を構築し、2 つの積み重ねられたエンコーダブロックを使用して破損したパッチトークンを予測し、元のパッチピクセルを再構築します。実験では、300 エポックのみの ViT-B モデルが、ImageNet-1K 分類で 84.3% のトップ 1 精度、ADE20K セマンティックセグメンテーションで 51.6% を達成することが示されています。私たちのアプローチは、他のモデルからの蒸留トレーニングなしで BEiTv2 で競争力のある結果を達成し、MAE などの他の方法よりも優れています。

Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using Vision Transformer (ViT). Previous works can be pixel-based or token-based, using original pixels or discrete visual tokens from parametric tokenizer models, respectively. Our proposed approach, CCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of tokenizer model. The centroids represent patch pixels and index tokens and have the property of local invariance. Non-parametric centroid tokenizer only takes seconds to create and is faster for token inference. Specifically, we adopt patch masking and centroid replacement strategies to construct corrupted inputs, and two stacked encoder blocks to predict corrupted patch tokens and reconstruct original patch pixels. Experiments show that the ViT-B model with only 300 epochs achieves 84.3% top-1 accuracy on ImageNet-1K classification and 51.6% on ADE20K semantic segmentation. Our approach achieves competitive results with BEiTv2 without distillation training from other models and outperforms other methods such as MAE.

updated: Wed Mar 08 2023 15:34:57 GMT+0000 (UTC)

published: Wed Mar 08 2023 15:34:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト