mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Haiyang Xu; Qinghao Ye; Ming Yan; Yaya Shi; Jiabo Ye; Yuanhong Xu; Chenliang Li; Bin Bi; Qi Qian; Wei Wang; Guohai Xu; Ji Zhang; Songfang Huang; Fei Huang; Jingren Zhou

mPLUG-2: テキスト、画像、ビデオにわたるモジュール化されたマルチモーダル基盤モデル

近年、言語、ビジョン、およびマルチモーダル事前トレーニングの大きな収束が見られました。この作業では、モダリティのもつれの問題に対処しながら、モダリティのコラボレーションから利益を得ることができる、マルチモーダルの事前トレーニング用のモジュール化された設計を備えた新しい統合パラダイムである mPLUG-2 を紹介します。シーケンス間の生成またはエンコーダーベースのインスタンス識別のみに依存する主要なパラダイムとは対照的に、mPLUG-2 は、モダリティコラボレーションのための共通のユニバーサルモジュールを共有し、モダリティのもつれに対処するために異なるモダリティモジュールを解きほぐすことにより、マルチモジュール構成ネットワークを導入します。 .テキスト、画像、ビデオを含むすべてのモダリティにわたって、さまざまな理解および生成タスク用にさまざまなモジュールを柔軟に選択できます。実証研究では、mPLUG-2 が 30 を超える広範なダウンストリームタスクで最先端または競争力のある結果を達成し、画像テキストとビデオテキストの理解と生成のマルチモーダルタスク、およびユニモーダルタスクにまたがることが示されています。テキストのみ、画像のみ、ビデオのみの理解。特に、mPLUG-2 は、困難な MSRVTT ビデオ QA およびビデオキャプションタスクで、はるかに小さいモデルサイズとデータスケールで、48.0 のトップ 1 精度と 80.3 CIDEr という新しい最先端の結果を示しています。また、視覚言語およびビデオ言語タスクで強力なゼロショット転送可能性を示します。コードとモデルは https://github.com/alibaba/AliceMind でリリースされます。

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.

updated: Wed Feb 01 2023 12:40:03 GMT+0000 (UTC)

published: Wed Feb 01 2023 12:40:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト