Freestyle Layout-to-Image Synthesis

Han Xue; Zhiwu Huang; Qianru Sun; Li Song; Wenjun Zhang

フリースタイルのレイアウトから画像への合成

典型的なレイアウトから画像への合成 (LIS) モデルは、セマンティッククラスの閉じたセット (COCO-Stuff の 182 の共通オブジェクトなど) の画像を生成します。この作業では、モデルのフリースタイル機能、つまり、目に見えないセマンティクス (クラス、属性、スタイルなど) を特定のレイアウトにどこまで生成できるかを調べ、タスクをフリースタイル LIS (FLIS) と呼びます。大規模な事前トレーニング済み言語画像モデルの開発のおかげで、限定された基本クラスでトレーニングされた多くの識別モデル (画像分類やオブジェクト検出など) は、目に見えないクラス予測の能力を備えています。これに触発されて、大規模な事前トレーニング済みのテキストから画像への拡散モデルを活用して、目に見えないセマンティクスの生成を実現することを選択しました。 FLIS の主な課題は、拡散モデルが特定のレイアウトから画像を合成できるようにする方法です。これは、事前に学習した知識に違反する可能性が非常に高く、たとえば、モデルは事前トレーニング中に「ベンチに座っているユニコーン」を決して見ません。この目的のために、セマンティックマスクを統合するために拡散モデルに簡単にプラグインできる Rectified Cross-Attention (RCA) と呼ばれる新しいモジュールを導入します。この「プラグイン」は、モデルの各クロスアテンションレイヤーに適用され、画像トークンとテキストトークンの間のアテンションマップを修正します。 RCA の重要なアイデアは、各テキストトークンが指定された領域のピクセルに作用するよう強制することです。これにより、事前に訓練された知識 (一般的) からさまざまなセマンティクスを特定のレイアウト (固有) に自由に配置できます。 .広範な実験により、提案された拡散ネットワークが、さまざまなテキスト入力を使用して現実的でフリースタイルのレイアウトから画像への生成結果を生成することが示されています。これは、多数の興味深いアプリケーションを生み出す可能性が高いです。コードは https://github.com/essunny310/FreestyleNet で入手できます。

Typical layout-to-image synthesis (LIS) models generate images for a closed set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). Extensive experiments show that the proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs, which has a high potential to spawn a bunch of interesting applications. Code is available at https://github.com/essunny310/FreestyleNet.

updated: Sat Mar 25 2023 09:37:41 GMT+0000 (UTC)

published: Sat Mar 25 2023 09:37:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト