Retrieval-Augmented Multimodal Language Modeling

Michihiro Yasunaga; Armen Aghajanyan; Weijia Shi; Rich James; Jure Leskovec; Percy Liang; Mike Lewis; Luke Zettlemoyer; Wen-tau Yih

検索拡張マルチモーダル言語モデリング

DALL-E や CM3 などの最近のマルチモーダルモデルは、テキストから画像への生成、および画像からテキストへの生成において目覚ましい進歩を遂げています。ただし、これらのモデルは学習したすべての知識 (エッフェル塔の外観など) をモデルパラメーターに保存するため、より多くの知識を取得するにはますます大規模なモデルとトレーニングデータが必要になります。よりスケーラブルでモジュール式の方法で知識を統合するために、我々は検索拡張マルチモーダルモデルを提案します。これにより、基本マルチモーダルモデル (ジェネレーター) が、外部メモリ (Web 上の文書など) から取得した関連テキストや画像を参照できるようになります。）。具体的には、取得者には事前トレーニング済みの CLIP を使用し、ジェネレーターには LAION データセットで CM3 Transformer をトレーニングします。結果として得られたモデルは、Retrieval-Augmented CM3 (RA-CM3) と名付けられ、テキストと画像の両方を取得および生成できる初のマルチモーダルモデルです。 RA-CM3 は、画像生成タスクとキャプション生成タスクの両方で DALL-E や CM3 などのベースラインマルチモーダルモデルを大幅に上回り (MS-COCO では FID 12 点、CIDEr 17 点向上)、トレーニングに必要なコンピューティングがはるかに少なくて済む (<30%) ことを示します。ダルイー）。さらに、RA-CM3 が、忠実な画像生成やマルチモーダルなインコンテキスト学習 (デモンストレーションからの画像生成など) などの新しい機能を発揮することを示します。

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).

updated: Tue Jun 06 2023 00:28:34 GMT+0000 (UTC)

published: Tue Nov 22 2022 20:26:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト