Transformer-based Image Generation from Scene Graphs

Renato Sortino; Simone Palazzo; Concetto Spampinato

シーングラフからのトランスフォーマーベースの画像生成

グラフ構造のシーン記述を生成モデルで効率的に使用して、生成された画像の構成を制御できます。以前のアプローチは、それぞれレイアウト予測と画像生成のためのグラフ畳み込みネットワークと敵対的手法の組み合わせに基づいていました。この作業では、マルチヘッドアテンションを使用してグラフ情報をエンコードし、画像生成の潜在空間でトランスフォーマーベースのモデルを使用することで、敵対的モデルを使用する必要なく、サンプリングされたデータの品質を向上させる方法を示します。トレーニングの安定性の点でその後の利点があります。具体的には、提案されたアプローチは、シーングラフを中間オブジェクトレイアウトにエンコードするため、およびこれらのレイアウトを画像にデコードするためのトランスフォーマーアーキテクチャに完全に基づいており、ベクトル量子化変分オートエンコーダーによって学習された低次元空間を通過します。私たちのアプローチは、最先端の方法に関して改善された画質と、同じシーングラフからの複数の世代間の高度な多様性を示しています。 Visual Genome、COCO、および CLEVR の 3 つの公開データセットでアプローチを評価します。 COCO と Visual Genome で、それぞれ 13.7 と 12.8 のインセプションスコア、52.3 と 60.3 の FID を達成しています。各コンポーネントの影響を評価するために、貢献についてアブレーション研究を行います。コードは https://github.com/perceivelab/trf-sg2im で入手できます

Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability. The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im

updated: Wed Mar 08 2023 14:54:51 GMT+0000 (UTC)

published: Wed Mar 08 2023 14:54:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト