Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Hongwei Xue; Tiankai Hang; Yanhong Zeng; Yuchong Sun; Bei Liu; Huan Yang; Jianlong Fu; Baining Guo

大規模なビデオ転写による高解像度ビデオ言語表現の進歩

クロスモダリティ学習を可能にし、豊富なダウンストリームVLタスクに利益をもたらすために、ビデオと言語（VL）の共同事前トレーニングを研究します。既存の作品は、低品質のビデオ機能を抽出するか、限られたテキストの埋め込みを学習しますが、高解像度のビデオと多様なセマンティクスがクロスモダリティ学習を大幅に改善できることを無視しています。この論文では、多くの視覚的タスクのための新しい高解像度で多様化されたVIdeo-Language事前トレーニングモデル（HD-VILA）を提案します。特に、2つの異なるプロパティを持つ大規模なデータセットを収集します。1）371.5k時間の720pビデオを含む最初の高解像度データセットと2）15の人気のあるYouTubeカテゴリをカバーする最も多様なデータセット。 VLの事前トレーニングを可能にするために、豊富な時空間機能を学習するハイブリッドTransformerと、学習したビデオ機能と多様なテキストとの相互作用を強制するマルチモーダルTransformerによって、HD-VILAモデルを共同で最適化します。私たちの事前トレーニングモデルは、10個のVL理解タスクと2個の新しいテキストからビジュアルへの生成タスクで新しい最先端の結果を達成します。たとえば、ゼロショットMSR-VTTテキストからビデオへの検索タスクで40.4％R @ 1、高解像度データセットLSMDCで55.4％の相対的な増加で、SOTAモデルよりも優れています。学習したVL埋め込みは、テキストからビジュアルへの編集や超解像タスクで視覚的に心地よく意味的に関連性のある結果を生成するのにも効果的です。

We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts. Our pre-training model achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks. For example, we outperform SOTA models with relative increases of 40.4% R@1 in zero-shot MSR-VTT text-to-video retrieval task and 55.4% in high-resolution dataset LSMDC. The learned VL embedding is also effective in generating visually pleasing and semantically relevant results in text-to-visual editing and super-resolution tasks.

updated: Fri Jul 08 2022 08:46:43 GMT+0000 (UTC)

published: Fri Nov 19 2021 17:36:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト