InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang; Yinan He; Yizhuo Li; Kunchang Li; Jiashuo Yu; Xin Ma; Xinyuan Chen; Yaohui Wang; Ping Luo; Ziwei Liu; Yali Wang; Limin Wang; Yu Qiao

InternVid: マルチモーダルな理解と生成のための大規模なビデオテキストデータセット

この論文では、マルチモーダルの理解と生成のための強力で転送可能なビデオテキスト表現の学習を可能にする、大規模なビデオ中心のマルチモーダルデータセットである InternVid を紹介します。 InternVid データセットには、約 76 万時間続く 700 万以上のビデオが含まれており、合計 41 億ワードの詳細な説明を伴う 2 億 3,400 万のビデオクリップが生成されます。私たちの主な貢献は、大規模言語モデル (LLM) を使用して高品質のビデオテキストデータセットを自律的に構築するスケーラブルなアプローチを開発し、それによってビデオ言語表現を大規模に学習する際の有効性を実証することです。具体的には、マルチスケールアプローチを利用してビデオ関連の説明を生成します。さらに、ViT-L に基づくビデオテキスト表現学習モデル ViCLIP を紹介します。 InternVid で対照学習を通じて学習されたこのモデルは、最先端のゼロショットアクション認識と競争力のあるビデオ検索パフォーマンスを実証します。認識や検索などの基本的なビデオ理解タスクを超えて、私たちのデータセットとモデルには幅広い用途があります。これらは、ビデオ中心の対話システムを学習するためのインターリーブされたビデオとテキストのデータを生成し、ビデオからテキストへの生成およびテキストからビデオへの生成研究を進めるのに特に有益です。これらの提案されたリソースは、マルチモーダルビデオの理解と生成に関心のある研究者や実践者にツールを提供します。

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

updated: Thu Jul 13 2023 17:58:32 GMT+0000 (UTC)

published: Thu Jul 13 2023 17:58:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト