The effectiveness of MAE pre-pretraining for billion-scale pretraining

Mannat Singh; Quentin Duval; Kalyan Vasudev Alwala; Haoqi Fan; Vaibhav Aggarwal; Aaron Adcock; Armand Joulin; Piotr Dollár; Christoph Feichtenhofer; Ross Girshick; Rohit Girdhar; Ishan Misra

10億規模の事前訓練に対するMAE事前事前訓練の有効性

このホワイトペーパーでは、視覚認識タスクのコンピュータービジョンで使用される標準的な事前トレーニング後微調整パラダイムを再検討します。通常、最先端の基盤モデルは、数十億の画像を含む大規模な (弱い) 教師ありデータセットを使用して事前トレーニングされています。シンプルで、自己教師あり MAE 手法を使用してモデルを初期化する追加の事前事前トレーニングステージを導入します。 MAE はモデルのサイズに合わせてスケーリングすることしか示されていませんが、トレーニングデータセットのサイズにもスケーリングされることがわかりました。したがって、MAE ベースの事前事前トレーニングは、モデルとデータサイズの両方に合わせてスケーリングし、基礎モデルのトレーニングに適用できます。事前事前トレーニングは、さまざまなモデルスケール (数百万から数十億のパラメーター) とデータセットサイズ (数百万から数十億の画像) にわたって、モデルの収束とダウンストリーム転送パフォーマンスの両方を一貫して改善します。画像分類、ビデオ認識、オブジェクト検出、ローショット分類、ゼロショット認識にまたがる 10 の異なる視覚認識タスクに対する事前事前トレーニングの有効性を測定します。私たちの最大のモデルは、iNaturalist-18 (91.3%)、1 ショット ImageNet-1k (62.1%)、Food-101 (96.0%) のゼロショット転送で新しい最先端の結果を達成しています。私たちの調査では、数十億の画像を使用した Web スケールの事前トレーニングであっても、モデルの初期化が重要な役割を果たすことが明らかになりました。

This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer on Food-101 (96.0%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images.

updated: Thu Mar 23 2023 17:56:12 GMT+0000 (UTC)

published: Thu Mar 23 2023 17:56:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト