VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Zhan Tong; Yibing Song; Jue Wang; Limin Wang

VideoMAE：マスクされたオートエンコーダーは、自己監視型ビデオ事前トレーニングのためのデータ効率の高い学習者です

比較的小さなデータセットで最高のパフォーマンスを実現するには、通常、超大規模なデータセットでビデオトランスフォーマーを事前トレーニングする必要があります。このホワイトペーパーでは、ビデオマスクオートエンコーダ（VideoMAE）が、自己監視型ビデオ事前トレーニング（SSVP）のデータ効率の高い学習者であることを示します。最近のImageMAEに触発され、カスタマイズされたビデオチューブのマスキングと再構成を提案します。これらのシンプルなデザインは、ビデオ再構成中の時間的相関によって引き起こされる情報漏えいを克服するのに効果的であることがわかります。 SSVPについて、次の3つの重要な結果が得られます。（1）マスキング率の比率が非常に高い場合（つまり、90％から95％）でも、VideoMAEのパフォーマンスは良好です。時間的に冗長なビデオコンテンツは、画像よりも高いマスキング率を可能にします。（2）VideoMAEは、余分なデータを使用せずに、非常に小さなデータセット（つまり、約3k〜4kのビデオ）で印象的な結果を達成します。これは、高レベルの構造学習を実施するためのビデオ再構成の困難なタスクに部分的に起因しています。（3）VideoMAEは、SSVPのデータ量よりもデータ品質の方が重要であることを示しています。事前トレーニングとターゲットデータセット間のドメインシフトは、SSVPの重要な問題です。特に、バニラViTバックボーンを備えたVideoMAEは、追加のデータを使用せずに、Kinects-400で83.9％、Something-Something V2で75.3％、UCF101で90.8％、HMDB51で61.1％を達成できます。コードはhttps://github.com/MCG-NJU/VideoMAEでリリースされます。

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking and reconstruction. These simple designs turn out to be effective for overcoming information leakage caused by the temporal correlation during video reconstruction. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. This is partially ascribed to the challenging task of video reconstruction to enforce high-level structure learning. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets are important issues in SSVP. Notably, our VideoMAE with the vanilla ViT backbone can achieve 83.9% on Kinects-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51 without using any extra data. Code will be released at https://github.com/MCG-NJU/VideoMAE.

updated: Wed Mar 23 2022 17:55:10 GMT+0000 (UTC)

published: Wed Mar 23 2022 17:55:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト