Grounding Language Models to Images for Multimodal Inputs and Outputs

Jing Yu Koh; Ruslan Salakhutdinov; Daniel Fried

マルチモーダルな入出力のための言語モデルを画像にグラウンディングする

私たちは、事前学習済みのテキストのみの言語モデルを視覚領域に根付かせる効率的な方法を提案します。これにより、任意にインターリーブされた画像とテキストのデータを処理し、取得した画像にインターリーブされたテキストを生成できるようになります。私たちの手法は、コンテキスト内学習や自由形式のテキスト生成など、大規模なテキストのみの事前トレーニングから学習した言語モデルの機能を活用します。言語モデルを凍結したままにし、入力線形層と出力線形層を微調整して、クロスモダリティ相互作用を可能にします。これにより、モデルは任意にインターリーブされた画像とテキストの入力を処理し、取得した画像をインターリーブした自由形式のテキストを生成できるようになります。コンテキストに応じた画像検索やマルチモーダルな対話などの地に足のついたタスクで強力なゼロショットパフォーマンスを実現し、魅力的なインタラクティブな能力を発揮します。私たちのアプローチは、あらゆる既製の言語モデルで機能し、視覚に基づいた設定で事前トレーニングされた言語モデルを活用するための効果的で一般的なソリューションへの道を開きます。

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.

updated: Tue Jun 13 2023 21:54:58 GMT+0000 (UTC)

published: Tue Jan 31 2023 18:33:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト