Do DALL-E and Flamingo Understand Each Other?

Hang Li; Jindong Gu; Rajat Koner; Sahand Sharifzadeh; Volker Tresp

ダルイーとフラミンゴは分かり合えますか?

画像とテキストの両方の理解と作成に焦点を当てたマルチモーダル研究の分野は、大幅な進歩を遂げています。この進歩は、有名な Flamingo モデルやテキストから画像への生成モデルなど、大規模な画像キャプション専用の洗練されたモデルの出現によって実証されており、DALL-E がその顕著な例として機能します。この分野で探求する価値のある興味深い問題は、Flamingo と DALL-E がお互いを理解しているかどうかです。この疑問を研究するために、Flamingo が特定の画像の記述を生成し、DALL-E がこの記述を入力として使用して新しい画像を合成する再構成タスクを提案します。生成された画像が指定された画像と類似している場合、これらのモデルは相互に理解できると主張します。具体的には、画像再構成の品質とテキスト生成の品質との関係を研究します。画像の最適な記述とは、元の画像と同様の生成画像を生成するものであることがわかりました。この発見は、テキストから画像へのモデルと画像からテキストへのモデルを微調整するための統一フレームワークを提案する動機となっています。具体的には、再構成部分は、モデルの調整をガイドするために正則化損失を形成します。異なる画像キャプションおよび画像生成モデルを使用した複数のデータセットに対する広範な実験により、私たちの発見が検証され、提案された統合フレームワークの有効性が実証されました。 DALL-E と Flamingo は公開されていないため、残りの作業では Stable Diffusion と BLIP を使用します。プロジェクトの Web サイト: https://dalleflamingo.github.io。

The field of multimodal research focusing on the comprehension and creation of both images and text has witnessed significant strides. This progress is exemplified by the emergence of sophisticated models dedicated to image captioning at scale, such as the notable Flamingo model and text-to-image generative models, with DALL-E serving as a prominent example. An interesting question worth exploring in this domain is whether Flamingo and DALL-E understand each other. To study this question, we propose a reconstruction task where Flamingo generates a description for a given image and DALL-E uses this description as input to synthesize a new image. We argue that these models understand each other if the generated image is similar to the given image. Specifically, we study the relationship between the quality of the image reconstruction and that of the text generation. We find that an optimal description of an image is one that gives rise to a generated image similar to the original one. The finding motivates us to propose a unified framework to finetune the text-to-image and image-to-text models. Concretely, the reconstruction part forms a regularization loss to guide the tuning of the models. Extensive experiments on multiple datasets with different image captioning and image generation models validate our findings and demonstrate the effectiveness of our proposed unified framework. As DALL-E and Flamingo are not publicly available, we use Stable Diffusion and BLIP in the remaining work. Project website: https://dalleflamingo.github.io.

updated: Fri Aug 18 2023 18:44:51 GMT+0000 (UTC)

published: Fri Dec 23 2022 10:46:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト