Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training

Ling Yang; Zhilin Huang; Yang Song; Shenda Hong; Guohao Li; Wentao Zhang; Bin Cui; Bernard Ghanem; Ming-Hsuan Yang

マスクされた対照的な事前トレーニングによる画像生成への拡散ベースのシーングラフ

シーングラフなどのグラフ構造の入力から画像を生成することは、グラフ内のノードおよび接続をオブジェクトおよび画像内のそれらの関係に揃えるのが難しいため、独特の困難を伴います。ほとんどの既存の方法は、シーン画像の粗い構造をキャプチャするように設計されたシーングラフの画像のような表現であるシーンレイアウトを使用して、この課題に対処します。シーンレイアウトは手動で作成されるため、画像との位置合わせが完全に最適化されていない可能性があり、生成された画像と元のシーングラフとの間のコンプライアンスが最適化されていない可能性があります。この問題に取り組むために、画像との位置合わせを直接最適化することにより、シーングラフの埋め込みを学習することを提案します。具体的には、エンコーダーを事前トレーニングして、対応する画像を予測するシーングラフからグローバル情報とローカル情報の両方を抽出し、マスクされた自動エンコード損失とコントラスト損失の 2 つの損失関数に依存します。前者は、ランダムにマスクされた画像領域を再構築することによって埋め込みをトレーニングしますが、後者は、シーングラフに従って準拠画像と非準拠画像を区別するように埋め込みをトレーニングします。これらの埋め込みを考慮して、潜在拡散モデルを構築し、シーングラフから画像を生成します。結果として得られる SGDiff と呼ばれるメソッドは、シーングラフのノードと接続を変更することにより、生成された画像のセマンティック操作を可能にします。 Visual Genome および COCO-Stuff データセットでは、インセプションスコアとフレシェインセプションディスタンス (FID) メトリックの両方で測定されるように、SGDiff が最先端の方法よりも優れていることを示しています。 https://github.com/YangLing0818/SGDiff でソースコードとトレーニング済みモデルをリリースします。

Generating images from graph-structured inputs, such as scene graphs, is uniquely challenging due to the difficulty of aligning nodes and connections in graphs with objects and their relations in images. Most existing methods address this challenge by using scene layouts, which are image-like representations of scene graphs designed to capture the coarse structures of scene images. Because scene layouts are manually crafted, the alignment with images may not be fully optimized, causing suboptimal compliance between the generated images and the original scene graphs. To tackle this issue, we propose to learn scene graph embeddings by directly optimizing their alignment with images. Specifically, we pre-train an encoder to extract both global and local information from scene graphs that are predictive of the corresponding images, relying on two loss functions: masked autoencoding loss and contrastive loss. The former trains embeddings by reconstructing randomly masked image regions, while the latter trains embeddings to discriminate between compliant and non-compliant images according to the scene graph. Given these embeddings, we build a latent diffusion model to generate images from scene graphs. The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections. On the Visual Genome and COCO-Stuff datasets, we demonstrate that SGDiff outperforms state-of-the-art methods, as measured by both the Inception Score and Fréchet Inception Distance (FID) metrics. We will release our source code and trained models at https://github.com/YangLing0818/SGDiff.

updated: Mon Nov 21 2022 01:11:19 GMT+0000 (UTC)

published: Mon Nov 21 2022 01:11:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト