MGMAE: Motion Guided Masking for Video Masked Autoencoding

Bingkun Huang; Zhiyu Zhao; Guozhen Zhang; Yu Qiao; Limin Wang

MGMAE: ビデオマスク自動エンコーディング用のモーションガイド付きマスキング

マスクされた自動エンコーディングは、自己教師ありビデオ表現学習において優れたパフォーマンスを示しました。時間的な冗長性により、VideoMAE では高いマスキング率とカスタマイズされたマスキング戦略が実現しました。この論文では、モーションガイド付きマスキング戦略を導入することで、ビデオマスク自動エンコーディングのパフォーマンスをさらに向上させることを目指しています。私たちの重要な洞察は、モーションはビデオにおける一般的かつ固有の事前分布であり、マスクされた事前トレーニング中に考慮される必要があるということです。モーションガイド付きマスキングでは、モーション情報を明示的に組み込んで、時間的に一貫したマスキングボリュームを構築します。このマスキングボリュームに基づいて、マスクされていないトークンを時間内に追跡し、ビデオから時間的に一貫した一連のキューブをサンプリングできます。これらの時間的に整列されたマスクされていないトークンは、情報漏洩の問題を時間内にさらに軽減し、MGMAE がより有用な構造情報を学習することを促進します。私たちは、オンラインで効率的なオプティカルフロー推定器と後方マスキングマップワーピング戦略を使用して MGMAE を実装します。私たちはSomething-Something V2とKinetics-400のデータセットで実験を行い、オリジナルのVideoMAEよりもMGMAEのパフォーマンスが優れていることを実証しました。さらに、MGMAE が動きに適応した方法で時間的に一貫したキューブをサンプリングして、より効果的なビデオ事前トレーニングを行えることを示す視覚化分析を提供します。

Masked autoencoding has shown excellent performance on self-supervised video representation learning. Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. In this paper, we aim to further improve the performance of video masked autoencoding by introducing a motion guided masking strategy. Our key insight is that motion is a general and unique prior in video, which should be taken into account during masked pre-training. Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume. Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos. These temporal aligned unmasked tokens will further relieve the information leakage issue in time and encourage the MGMAE to learn more useful structure information. We implement our MGMAE with an online efficient optical flow estimator and backward masking map warping strategy. We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE. In addition, we provide the visualization analysis to illustrate that our MGMAE can sample temporal consistent cubes in a motion-adaptive manner for more effective video pre-training.

updated: Mon Aug 21 2023 15:39:41 GMT+0000 (UTC)

published: Mon Aug 21 2023 15:39:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト