EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Yuxin Fang; Wen Wang; Binhui Xie; Quan Sun; Ledell Wu; Xinggang Wang; Tiejun Huang; Xinlong Wang; Yue Cao

EVA: マスクされた視覚表現学習の限界を大規模に探る

ビジョン中心の基盤モデルである EVA を立ち上げ、公的にアクセス可能なデータのみを使用して大規模な視覚的表現の限界を探ります。 EVA は、可視画像パッチで調整されたマスクアウトされた画像とテキストの位置合わせされたビジョン機能を再構築するために事前にトレーニングされたバニラ ViT です。この口実タスクを介して、EVA を 10 億個のパラメーターに効率的にスケールアップし、画像認識、ビデオアクション認識、オブジェクト検出、インスタンスセグメンテーション、セマンティックセグメンテーションなど、広範囲の代表的なビジョンダウンストリームタスクで新しい記録を設定できます。トレーニング。さらに、スケーリング EVA の量的変化が、他のモデルには存在しない転移学習パフォーマンスの質的変化をもたらすことを観察します。たとえば、EVA は、困難な大語彙インスタンスセグメンテーションタスクで大きな飛躍を遂げています。私たちのモデルは、LVISv1.0 データセットで 1,000 を超えるカテゴリを、COCO データセットで 80 カテゴリのみを使用して、ほぼ同じ最先端のパフォーマンスを達成しています。純粋なビジョンエンコーダーを超えて、EVA は画像とテキストを接続するためのビジョン中心のマルチモーダルピボットとしても機能します。 EVA から巨大な CLIP のビジョンタワーを初期化すると、トレーニングが大幅に安定し、ゼロからのトレーニングよりはるかに少ないサンプル数と少ない計算量でパフォーマンスが向上し、マルチモーダル基盤モデルのコストのかかるトレーニングをスケールアップおよび加速するための新しい方向性が提供されることがわかりました。 .将来の研究を容易にするために、https://github.com/baaivision/EVA ですべてのコードとモデルをリリースします。

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.

updated: Mon Dec 05 2022 13:53:51 GMT+0000 (UTC)

published: Mon Nov 14 2022 18:59:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト