Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

Rui Wang; Dongdong Chen; Zuxuan Wu; Yinpeng Chen; Xiyang Dai; Mengchen Liu; Lu Yuan; Yu-Gang Jiang

マスクされたビデオ蒸留: 自己教師ありビデオ表現学習のためのマスクされた特徴モデリングの再考

マスクされた視覚モデリングの恩恵を受けて、自己教師ありビデオ表現学習は目覚ましい進歩を遂げました。ただし、既存の方法は、生のピクセル RGB 値などの低レベルの特徴を再構築することによって、ゼロから表現を学習することに重点を置いています。この論文では、ビデオ表現学習のためのシンプルで効果的な2段階のマスクされた機能モデリングフレームワークであるマスクされたビデオ蒸留（MVD）を提案します。最初に、マスクされたパッチの低レベルの機能を回復することにより、画像（またはビデオ）モデルを事前トレーニングし、次に得られた特徴を、マスクされた特徴モデリングのターゲットとして使用します。教師モデルの選択については、ビデオ教師によって教えられた生徒は、時間的に重いビデオタスクのパフォーマンスが向上するのに対し、イメージ教師は、空間的に重いビデオタスクに対してより強力な空間表現を転送することが観察されました。視覚化分析は、異なる教師が生徒に異なる学習パターンを生み出すことも示しています。この観察に動機付けられて、さまざまな教師の利点を活用するために、MVD の時空間共同教育方法を設計します。具体的には、マスクされた機能モデリングによって、ビデオ教師とイメージ教師の両方から学生モデルを抽出します。広範な実験結果は、時空間共同教育で事前トレーニングされたビデオトランスフォーマーが、多数のビデオデータセットで 1 人の教師で抽出されたモデルよりも優れていることを示しています。バニラ ViT を使用した当社の MVD は、いくつかの困難なビデオダウンストリームタスクで、以前の教師ありまたは自己教師ありの方法と比較して、最先端のパフォーマンスを実現します。たとえば、ViT-Large モデルでは、MVD は Kinetics-400 と Something-Something-v2 で 86.4% と 75.9% のトップ 1 精度を達成し、VideoMAE をそれぞれ 1.2% と 1.6% 上回っています。コードは https://github.com/ruiwang2021/mvd で入手できます。

Benefiting from masked visual modeling, self-supervised video representation learning has achieved remarkable progress. However, existing methods focus on learning representations from scratch through reconstructing low-level features like raw pixel RGB values. In this paper, we propose masked video distillation (MVD), a simple yet effective two-stage masked feature modeling framework for video representation learning: firstly we pretrain an image (or video) model by recovering low-level features of masked patches, then we use the resulting features as targets for masked feature modeling. For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks. Visualization analysis also indicates different teachers produce different learned patterns for students. Motivated by this observation, to leverage the advantage of different teachers, we design a spatial-temporal co-teaching method for MVD. Specifically, we distill student models from both video teachers and image teachers by masked feature modeling. Extensive experimental results demonstrate that video transformers pretrained with spatial-temporal co-teaching outperform models distilled with a single teacher on a multitude of video datasets. Our MVD with vanilla ViT achieves state-of-the-art performance compared with previous supervised or self-supervised methods on several challenging video downstream tasks. For example, with the ViT-Large model, our MVD achieves 86.4% and 75.9% Top-1 accuracy on Kinetics-400 and Something-Something-v2, outperforming VideoMAE by 1.2% and 1.6% respectively. Code will be available at https://github.com/ruiwang2021/mvd.

updated: Thu Dec 08 2022 18:59:59 GMT+0000 (UTC)

published: Thu Dec 08 2022 18:59:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト