Illiterate DALL-E Learns to Compose

Gautam Singh; Fei Deng; Sungjin Ahn

読み書きのできないDALL-Eが作曲を学ぶ

DALL-Eは、画像生成における構成ベースの体系的な一般化の優れた能力を示していますが、テキストと画像のペアのデータセットが必要であり、構成性はテキストによって提供されます。対照的に、Slot Attentionモデルのようなオブジェクト中心の表現モデルは、テキストプロンプトなしで構成可能な表現を学習します。ただし、DALL-Eとは異なり、ゼロショット生成を体系的に一般化する機能は大幅に制限されています。この論文では、SLATEと呼ばれる、シンプルでありながら斬新なスロットベースの自動エンコードアーキテクチャを提案します。これは、テキストなしのゼロショット画像生成で体系的な一般化を可能にするオブジェクト中心の表現を学習するという、両方の長所を組み合わせるためのものです。そのため、このモデルは、文盲のDALL-Eモデルと見なすこともできます。既存のオブジェクト中心の表現モデルのピクセル混合デコーダーとは異なり、スロットとピクセル間の複雑な相互作用をキャプチャするために、スロットを条件とするImageGPTデコーダーを使用することを提案します。実験では、テキストプロンプトを必要としないこのシンプルで実装が容易なアーキテクチャが、配信内および配信外（ゼロショット）の画像生成を大幅に改善し、定性的に同等またはそれ以上のスロットアテンション構造を実現することを示します。混合デコーダーに基づくモデル。

Although DALL-E has shown an impressive ability of composition-based systematic generalization in image generation, it requires the dataset of text-image pairs and the compositionality is provided by the text. In contrast, object-centric representation models like the Slot Attention model learn composable representations without the text prompt. However, unlike DALL-E its ability to systematically generalize for zero-shot generation is significantly limited. In this paper, we propose a simple but novel slot-based autoencoding architecture, called SLATE, for combining the best of both worlds: learning object-centric representations that allows systematic generalization in zero-shot image generation without text. As such, this model can also be seen as an illiterate DALL-E model. Unlike the pixel-mixture decoders of existing object-centric representation models, we propose to use the Image GPT decoder conditioned on the slots for capturing complex interactions among the slots and pixels. In experiments, we show that this simple and easy-to-implement architecture not requiring a text prompt achieves significant improvement in in-distribution and out-of-distribution (zero-shot) image generation and qualitatively comparable or better slot-attention structure than the models based on mixture decoders.

updated: Mon Mar 14 2022 21:10:39 GMT+0000 (UTC)

published: Sun Oct 17 2021 16:40:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト