VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Zhan Tong; Yibing Song; Jue Wang; Limin Wang

VideoMAE：マスクされたオートエンコーダーは、自己監視型ビデオ事前トレーニングのためのデータ効率の高い学習者です

比較的小さなデータセットで最高のパフォーマンスを実現するには、通常、超大規模なデータセットでビデオトランスフォーマーを事前トレーニングする必要があります。このホワイトペーパーでは、ビデオマスクオートエンコーダ（VideoMAE）が、自己監視型ビデオ事前トレーニング（SSVP）のデータ効率の高い学習者であることを示します。最近のImageMAEに触発され、非常に高い比率でカスタマイズされたビデオチューブマスキングを提案します。このシンプルなデザインにより、ビデオの再構築はより困難な自己監視タスクになり、この事前トレーニングプロセス中に、より効果的なビデオ表現を抽出することができます。 SSVPについて、次の3つの重要な結果が得られます。（1）マスキング率の比率が非常に高い場合（つまり、90％から95％）でも、VideoMAEのパフォーマンスは良好です。時間的に冗長なビデオコンテンツは、画像よりも高いマスキング率を可能にします。（2）VideoMAEは、余分なデータを使用せずに、非常に小さなデータセット（つまり、約3k〜4kのビデオ）で印象的な結果を達成します。（3）VideoMAEは、SSVPのデータ量よりもデータ品質の方が重要であることを示しています。事前トレーニングとターゲットデータセット間のドメインシフトは重要な問題です。特に、バニラViTを使用したVideoMAEは、追加のデータを使用せずに、Kinetics-400で85.8％、Something-Something V2で75.3％、UCF101で90.8％、HMDB51で61.1％を達成できます。コードはhttps://github.com/MCG-NJU/VideoMAEで入手できます。

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 85.8% on Kinetics-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.

updated: Thu Jul 07 2022 14:38:38 GMT+0000 (UTC)

published: Wed Mar 23 2022 17:55:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト