An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Tsu-Jui Fu; Linjie Li; Zhe Gan; Kevin Lin; William Yang Wang; Lijuan Wang; Zicheng Liu

マスクされたビジュアルモデリングを使用したエンドツーエンドのビデオ言語トランスフォーマーの実証的研究

マスクされたビジュアルモデリング (MVM) は、視覚的な事前トレーニングに効果的であることが最近証明されました。ビデオ入力 (マスクされたフレームモデリングなど) に対する同様の再構築の目的がビデオ言語 (VidL) の事前トレーニングで検討されていますが、以前の研究で事前に抽出されたビデオの特徴は、事前トレーニング中に MVM を介して洗練することはできません。不十分な下流のパフォーマンスに。この作業では、VidL 学習のコンテキストで MVM の可能性を体系的に調べます。具体的には、完全なエンドツーエンドの VIdeO-LanguagE Transformer (VIOLET) に基づいて研究を行い、固定ビデオ表現と MVM トレーニングの間の切断を軽減します。低レベルのピクセル値と方向付けられた勾配から、高レベルの深度マップ、オプティカルフロー、個別の視覚的トークン、および潜在的な視覚的特徴まで、合計で 8 つの異なる MVM の再構築ターゲットが調査されます。包括的な実験を実施し、効果的な MVM トレーニングにつながる要因についての洞察を提供します。経験的に、MVM 目的で事前トレーニングされた VIOLET は、ビデオ質問応答、ビデオキャプション、テキストからビデオへの検索に至るまで、13 の VidL ベンチマークで顕著な改善を達成することを示しています。

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, the pre-extracted video features in previous studies cannot be refined through MVM during pre-training, and thus leading to unsatisfactory downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), which mitigates the disconnection between fixed video representations and MVM training. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens and latent visual features. We conduct comprehensive experiments and provide insights on the factors leading to effective MVM training. Empirically, we show VIOLET pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

updated: Sun Sep 04 2022 06:30:32 GMT+0000 (UTC)

published: Sun Sep 04 2022 06:30:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト