VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Hao Tan; Jie Lei; Thomas Wolf; Mohit Bansal

VIMPAC：マスクされたトークン予測と対照学習によるビデオ事前トレーニング

ビデオの理解は、グローバルコンテンツを認識し、その内部接続（因果関係、動き、時空間対応など）をモデル化することに依存しています。これらの相互作用を学習するために、VQ-VAEを介して生成された離散化されたビデオトークンにマスクしてから予測する事前トレーニングタスクを適用します。テキストトークンがより独立している言語とは異なり、隣接するビデオトークンは通常、強い相関関係があります（たとえば、連続するビデオフレームは通常、非常によく似ています）。したがって、個々のトークンを均一にマスキングすると、タスクが簡単すぎて有用な表現を学習できなくなります。この問題に対処するために、空間ドメインと時間ドメインの両方で隣接するビデオトークンをマスクするブロックワイズマスキング戦略を提案します。また、ビデオクリップが同じビデオからサンプリングされているかどうかを予測することにより、グローバルコンテンツをさらにキャプチャするために、拡張なしの対照学習方法を追加します。キュレーションされていないビデオでモデルを事前トレーニングし、事前トレーニングされたモデルがいくつかのビデオ理解データセット（SSV2、Diving48など）で最先端の結果に到達できることを示します。最後に、モデルのスケーラビリティと事前トレーニング方法の設計に関する詳細な分析を提供します。コードはhttps://github.com/airsplay/vimpacでリリースされています。

Video understanding relies on perceiving the global content and modeling its internal connections (e.g., causality, movement, and spatio-temporal correspondence). To learn these interactions, we apply a mask-then-predict pre-training task on discretized video tokens generated via VQ-VAE. Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e.g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations. To deal with this issue, we propose a block-wise masking strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture the global content by predicting whether the video clips are sampled from the same video. We pre-train our model on uncurated videos and show that our pre-trained model can reach state-of-the-art results on several video understanding datasets (e.g., SSV2, Diving48). Lastly, we provide detailed analyses on model scalability and pre-training method design. Code is released at https://github.com/airsplay/vimpac.

updated: Mon Jun 21 2021 16:48:19 GMT+0000 (UTC)

published: Mon Jun 21 2021 16:48:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト