i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Ziyi Yang; Mahmoud Khademi; Yichong Xu; Reid Pryzant; Yuwei Fang; Chenguang Zhu; Dongdong Chen; Yao Qian; Mei Gao; Yi-Ling Chen; Robert Gmyr; Naoyuki Kanda; Noel Codella; Bin Xiao; Yu Shi; Lu Yuan; Takuya Yoshioka; Michael Zeng; Xuedong Huang

i-Code V2: 視覚、言語、音声データに対する自己回帰生成フレームワーク

テキスト、ビジュアル、オーディオデータの収束は、人間のような人工知能に向けた重要なステップですが、現在の視覚-言語-音声の状況は、生成能力を持たないエンコーダーのみのモデルによって支配されています。私たちは、視覚、言語、音声データの任意の組み合わせから自然言語を生成できる最初のモデルである i-Code V2 でこのギャップを埋めることを提案します。 i-Code V2 は、最先端の単一モダリティエンコーダを活用し、その出力を新しいモダリティ融合エンコーダと組み合わせて、モダリティの組み合わせを共有表現空間に柔軟に投影する統合システムです。次に、自己回帰デコーダを介してこれらの表現から言語トークンが生成されます。フレームワーク全体は、モダリティの任意の組み合わせにわたって一般化できる新しいテキスト補完目標を使用して、デュアルおよびシングルモダリティデータセットの大規模なコレクションでエンドツーエンドで事前トレーニングされています。 i-Code V2 は、7 つのマルチモーダルタスクにおいて最先端のシングルおよびデュアルモダリティベースラインと同等またはそれを上回るパフォーマンスを示し、多様なタスクと信号にわたる生成マルチモーダル事前トレーニングの能力を実証します。

The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals.

updated: Sun May 21 2023 01:25:44 GMT+0000 (UTC)

published: Sun May 21 2023 01:25:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト