Collage Diffusion

Vishnu Sarukkai; Linden Li; Arden Ma; Christopher Ré; Kayvon Fatahalian

コラージュ拡散

Collage Diffusion

テキスト条件付き拡散モデルは、高品質で多様な画像を生成します。ただし、テキストは、多くの場合、目的のターゲットイメージの仕様があいまいなため、拡散ベースのイメージ生成用にユーザーフレンドリーなコントロールを追加する必要があります。複数のオブジェクトがあるシーンの画像出力を正確に制御することに重点を置いています。ユーザーは、コラージュを定義することによって画像生成を制御します。コラージュは、レイヤーの順序付けられたシーケンスとペアになったテキストプロンプトです。各レイヤーは RGBA 画像と対応するテキストプロンプトです。コラージュ拡散は、ユーザーがシーン内のオブジェクトの空間配置と視覚属性の両方を制御できるようにするコラージュ条件付き拡散アルゴリズムであり、生成された画像の個々のコンポーネントを編集することもできます。入力テキストのさまざまな部分が、入力コラージュレイヤーで指定されたさまざまな場所に対応するようにするために、コラージュ拡散は、レイヤーのアルファマスクを使用して、テキストと画像の相互注意を修正します。テキストで指定されていない個々のコラージュレイヤーの特性を維持するために、コラージュディフュージョンはレイヤーごとに特殊なテキスト表現を学習します。コラージュ入力は、最終出力をきめ細かく制御するレイヤーベースのコントロールも可能にします。ユーザーは、レイヤーごとに画像の調和を制御でき、生成された画像内の個々のオブジェクトを編集しながら、他のオブジェクトを固定したままにすることができます。コラージュ条件付き画像生成では、入力コラージュを調和させてオブジェクトを適合させる必要があります。重要な課題は、入力コラージュ内のオブジェクトの位置と主要な視覚的属性の変化を最小限に抑えながら、調和プロセスでコラージュの他の属性を変更できるようにすることです。レイヤー入力に存在する豊富な情報を活用することで、Collage Diffusion は、目的のオブジェクトの位置と視覚的特性を以前のアプローチよりも優れた状態に維持する、全体的に調和のとれた画像を生成します。

Text-conditional diffusion models generate high-quality, diverse images. However, text is often an ambiguous specification for a desired target image, creating the need for additional user-friendly controls for diffusion-based image generation. We focus on having precise control over image output for scenes with several objects. Users control image generation by defining a collage: a text prompt paired with an ordered sequence of layers, where each layer is an RGBA image and a corresponding text prompt. We introduce Collage Diffusion, a collage-conditional diffusion algorithm that allows users to control both the spatial arrangement and visual attributes of objects in the scene, and also enables users to edit individual components of generated images. To ensure that different parts of the input text correspond to the various locations specified in the input collage layers, Collage Diffusion modifies text-image cross-attention with the layers' alpha masks. To maintain characteristics of individual collage layers that are not specified in text, Collage Diffusion learns specialized text representations per layer. Collage input also enables layer-based controls that provide fine-grained control over the final output: users can control image harmonization on a layer-by-layer basis, and they can edit individual objects in generated images while keeping other objects fixed. Collage-conditional image generation requires harmonizing the input collage to make objects fit together--the key challenge involves minimizing changes in the positions and key visual attributes of objects in the input collage while allowing other attributes of the collage to change in the harmonization process. By leveraging the rich information present in layer input, Collage Diffusion generates globally harmonized images that maintain desired object locations and visual characteristics better than prior approaches.

updated: Wed Mar 01 2023 06:35:42 GMT+0000 (UTC)

published: Wed Mar 01 2023 06:35:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト