VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Limin Wang; Bingkun Huang; Zhiyu Zhao; Zhan Tong; Yinan He; Yi Wang; Yali Wang; Yu Qiao

VideoMAE V2: デュアルマスキングによるビデオマスクオートエンコーダのスケーリング

スケールは、さまざまなダウンストリームタスクに十分に一般化できる強力な基盤モデルを構築するための主要な要素です。ただし、何十億ものパラメーターを使用してビデオ基盤モデルをトレーニングすることは依然として困難です。このホワイトペーパーでは、ビデオマスクオートエンコーダー (VideoMAE) が、ビデオ基盤モデルを構築するためのスケーラブルで一般的な自己教師ありプレトレーナーであることを示します。コアデザインを使用して、モデルとデータの両方で VideoMAE をスケーリングします。具体的には、ビデオトークンのサブセットで動作するエンコーダーとビデオトークンの別のサブセットを処理するデコーダーを使用して、効率的な事前トレーニングのためのデュアルマスキング戦略を提示します。 VideoMAE はエンコーダのマスキング率が高いため非常に効率的ですが、マスキングデコーダを使用すると、全体的な計算コストをさらに削減できます。これにより、ビデオで数十億レベルのモデルを効率的に事前トレーニングできます。また、多様なマルチソースのラベルなしデータセットでの最初の事前トレーニングと、それに続く混合ラベル付きデータセットでの事後事前トレーニングを含む、漸進的なトレーニングパラダイムも使用します。最後に、Kinetics (K400 で 90.0%、K600 で 89.9%) および Something-Something (K600 で 68.7%) のデータセットで新しい最先端のパフォーマンスを達成する、10 億のパラメーターを使用したビデオ ViT モデルのトレーニングに成功しました。 V1 では 77.0%、V2 では 77.0%)。さらに、さまざまなダウンストリームタスクで事前トレーニング済みのビデオ ViT モデルを広範囲に検証し、一般的なビデオ表現学習器としての有効性を実証します。

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner.

updated: Wed Mar 29 2023 14:28:41 GMT+0000 (UTC)

published: Wed Mar 29 2023 14:28:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト