Generating Images with Multimodal Language Models

Jing Yu Koh; Daniel Fried; Ruslan Salakhutdinov

マルチモーダル言語モデルを使用した画像の生成

我々は、凍結されたテキストのみの大規模言語モデル (LLM) を、事前トレーニングされた画像エンコーダおよびデコーダモデルと、それらの埋め込み空間間でマッピングすることによって融合する方法を提案します。私たちのモデルは、画像検索、新しい画像生成、マルチモーダル対話など、幅広いマルチモーダル機能を実証します。私たちのアプローチは、任意にインターリーブされた画像とテキスト入力を条件付けして、一貫した画像 (およびテキスト) 出力を生成できる最初のアプローチです。画像生成で優れたパフォーマンスを達成するために、LLM を既製のテキストから画像への生成モデルに統合するための効率的なマッピングネットワークを提案します。このマッピングネットワークは、テキストの隠された表現をビジュアルモデルの埋め込み空間に変換し、LLM の強力なテキスト表現をビジュアル出力に活用できるようにします。私たちのアプローチは、より長く複雑な言語を使用するタスクでは、ベースライン生成モデルよりも優れたパフォーマンスを発揮します。新しい画像の生成に加えて、私たちのモデルは、事前に指定されたデータセットから画像を取得することもでき、推論時に取得するか生成するかを決定します。これは、LLM の隠れた表現を条件とする学習された決定モジュールを使用して行われます。私たちのモデルは、以前のマルチモーダル言語モデルと比較して幅広い機能を示します。画像とテキストの入力を処理し、取得された画像、生成された画像、および生成されたテキストを生成できます。コンテキスト依存性を測定するいくつかのテキストから画像へのタスク全体で、非 LLM ベースの生成モデルよりも優れたパフォーマンスを発揮します。

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal language models. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text -- outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence.

updated: Tue Jun 13 2023 22:13:51 GMT+0000 (UTC)

published: Fri May 26 2023 19:22:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト