Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang; Li Dong; Wenhui Wang; Yaru Hao; Saksham Singhal; Shuming Ma; Tengchao Lv; Lei Cui; Owais Khan Mohammed; Barun Patra; Qiang Liu; Kriti Aggarwal; Zewen Chi; Johan Bjorck; Vishrav Chaudhary; Subhojit Som; Xia Song; Furu Wei

必要なのは言語だけではない: 認識を言語モデルに合わせる

言語、マルチモーダルな知覚、アクション、および世界モデリングの大きな収束は、汎用人工知能に向けた重要なステップです。この作業では、Kosmos-1 を導入します。これは、一般的なモダリティを認識し、コンテキストで学習し (つまり、少数のショット)、指示に従う (つまり、ゼロショット) ことができる Multimodal Large Language Model (MLLM) です。具体的には、任意にインターリーブされたテキストと画像、画像とキャプションのペア、およびテキストデータを含む、Web スケールのマルチモーダルコーパスで Kosmos-1 をゼロからトレーニングします。勾配の更新や微調整を行わずに、さまざまなタスクで、ゼロショット、少数ショット、マルチモーダル思考チェーンプロンプトなどのさまざまな設定を評価します。実験結果によると、Kosmos-1 は、(i) 言語の理解、生成、さらには OCR を使用しない NLP (ドキュメント画像を直接入力)、(ii) マルチモーダルダイアログ、画像キャプション、視覚的質問などの認識言語タスクで優れたパフォーマンスを発揮することが示されています。回答、および（iii）説明付きの画像認識（テキスト指示による分類の指定）などのビジョンタスク。また、MLLM がクロスモーダル転送、つまり言語からマルチモーダルへ、およびマルチモーダルから言語への知識の転送から利益を得ることができることも示します。さらに、MLLM の非言語的推論能力を診断する Raven IQ テストのデータセットを紹介します。

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

updated: Wed Mar 01 2023 11:04:51 GMT+0000 (UTC)

published: Mon Feb 27 2023 18:55:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト