SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis

Azade Farshad; Yousef Yeganeh; Yu Chi; Chengzhi Shen; Björn Ommer; Nassir Navab

SceneGenie: 画像合成のためのシーングラフ誘導拡散モデル

テキスト条件付きの画像生成は、生成的な敵対的ネットワークや、最近では拡散モデルによって、近年大きな進歩を遂げています。テキストプロンプトで調整された拡散モデルは印象的で高品質の画像を生成しましたが、特定のオブジェクトのインスタンス数などの複雑なテキストプロンプトを正確に表現することは依然として困難です。この制限に対処するために、追加のトレーニングデータなしで、推論時にバウンディングボックスとセグメンテーションマップ情報を活用する、拡散モデルのサンプリングプロセスの新しいガイダンスアプローチを提案します。サンプリングプロセスにおける新たな損失を通じて、私たちのアプローチは、CLIP 埋め込みからのセマンティック機能を使用してモデルをガイドし、幾何学的制約を適用して、シーンを正確に表す高解像度画像に導きます。境界ボックスとセグメンテーションマップ情報を取得するために、テキストプロンプトをシーングラフとして構造化し、CLIP 埋め込みでノードを強化します。私たちが提案したモデルは、シーングラフから画像を生成するための 2 つの公開ベンチマークで最先端のパフォーマンスを達成し、シーングラフから画像およびテキストベースの拡散モデルの両方をさまざまなメトリックで上回ります。私たちの結果は、より正確なテキストから画像への生成のために、拡散モデルのサンプリングプロセスにバウンディングボックスとセグメンテーションマップのガイダンスを組み込むことの有効性を示しています。

Text-conditioned image generation has made significant progress in recent years with generative adversarial networks and more recently, diffusion models. While diffusion models conditioned on text prompts have produced impressive and high-quality images, accurately representing complex text prompts such as the number of instances of a specific object remains challenging. To address this limitation, we propose a novel guidance approach for the sampling process in the diffusion model that leverages bounding box and segmentation map information at inference time without additional training data. Through a novel loss in the sampling process, our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints, leading to high-resolution images that accurately represent the scene. To obtain bounding box and segmentation map information, we structure the text prompt as a scene graph and enrich the nodes with CLIP embeddings. Our proposed model achieves state-of-the-art performance on two public benchmarks for image generation from scene graphs, surpassing both scene graph to image and text-based diffusion models in various metrics. Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process for more accurate text-to-image generation.

updated: Fri Apr 28 2023 00:14:28 GMT+0000 (UTC)

published: Fri Apr 28 2023 00:14:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト