Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders

Heng Pan; Chenyang Liu; Wenxiao Wang; Li Yuan; Hongfa Wang; Zhifeng Li; Wei Liu

Img2Vec: 高度なトークン多様性の教師がマスクされたオートエンコーダーを支援

深い特徴を持つマスクされた画像モデリング (MIM) のための画像からベクターへのパイプライン (Img2Vec) を提示します。どのタイプのディープフィーチャが学習ターゲットとして MIM に適しているかを調べるために、MIM の学習ターゲットとして Image をフィーチャベクトルに変換するために、一連の十分にトレーニングされた自己教師ありモデルを使用したシンプルな MIM フレームワークを提案します。 extractor は、教師モデルとしても知られています。驚くべきことに、MIM モデルは、Transformer ベースのモデル (ViT-Large、307M など) のような厄介な教師によるものよりも、いくつかの軽量モデル (ResNet-50、26M など) によって生成された画像特徴からより多くの恩恵を受けることが経験的にわかっています。この注目すべき現象を分析するために、異なるモデルから生成された特徴の特性を評価するために、新しい属性であるトークンの多様性を考案します。トークンの多様性は、異なるトークン間の特徴の相違を測定します。大規模な実験と視覚化を通じて、大規模なモデルが MIM を改善できるという認識を超えて、教師モデルの高いトークン多様性も重要であるという仮説を立てました。上記の議論に基づいて、Img2Vec はトークンの多様性が高い教師モデルを採用して画像の特徴を生成します。 ViT-B を使用して ImageNet のラベルなしデータで事前トレーニングされた Img2Vec は、微調整で 85.1% のトップ 1 精度をもたらします。さらに、より大きなモデル ViT-L および ViT-H で Img2Vec をスケールアップし、それぞれ 86.7% および 87.5% の精度を得ています。また、COCO で 51.8% の mAP、ADE20K で 50.7% の mIoU など、他のダウンストリームタスクでも最先端の結果を達成しています。 Img2Vec は、深層機能 MIM 学習に合わせて調整されたシンプルで効果的なフレームワークであり、代表的なビジョンタスクで優れた包括的なパフォーマンスを達成します。

We present a pipeline of Image to Vector (Img2Vec) for masked image modeling (MIM) with deep features. To study which type of deep features is appropriate for MIM as a learning target, we propose a simple MIM framework with serials of well-trained self-supervised models to convert an Image to a feature Vector as the learning target of MIM, where the feature extractor is also known as a teacher model. Surprisingly, we empirically find that an MIM model benefits more from image features generated by some lighter models (e.g., ResNet-50, 26M) than from those by a cumbersome teacher like Transformer-based models (e.g., ViT-Large, 307M). To analyze this remarkable phenomenon, we devise a novel attribute, token diversity, to evaluate the characteristics of generated features from different models. Token diversity measures the feature dissimilarity among different tokens. Through extensive experiments and visualizations, we hypothesize that beyond the acknowledgment that a large model can improve MIM, a high token-diversity of a teacher model is also crucial. Based on the above discussion, Img2Vec adopts a teacher model with high token-diversity to generate image features. Img2Vec pre-trained on ImageNet unlabeled data with ViT-B yields 85.1% top-1 accuracy on fine-tuning. Moreover, we scale up Img2Vec on larger models, ViT-L and ViT-H, and get 86.7% and 87.5% accuracy respectively. It also achieves state-of-the-art results on other downstream tasks, e.g., 51.8% mAP on COCO and 50.7% mIoU on ADE20K. Img2Vec is a simple yet effective framework tailored to deep feature MIM learning, accomplishing superb comprehensive performance on representative vision tasks.

updated: Tue Apr 25 2023 03:01:37 GMT+0000 (UTC)

published: Tue Apr 25 2023 03:01:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト