Generating Annotated High-Fidelity Images Containing Multiple Coherent Objects

Bryan G. Cardenas; Devanshu Arya; Deepak K. Gupta

複数のコヒーレントオブジェクトを含む注釈付きの高忠実度画像の生成

生成モデルに関連する最近の開発により、多様な高忠実度画像を生成することが可能になりました。特に、レイアウトから画像を生成するモデルは、個別のオブジェクトを含む現実的な複雑な画像を生成する機能により、大きな注目を集めています。これらのモデルは、通常、セマンティックレイアウトまたはテキストによる説明のいずれかを条件としています。ただし、自然画像とは異なり、生物医学イメージングやリモートセンシングなどのドメインでは、補助情報の提供が非常に困難な場合があります。この作業では、生成プロセス中にコンテキスト情報を明示的に要求せずに、複数のオブジェクトを含む画像を合成できるマルチオブジェクト生成フレームワークを提案します。ベクトル量子化変分オートエンコーダー（VQ-VAE）バックボーンに基づいて、私たちのモデルは、2つの強力な自己回帰優先順位（PixelSNAILとLayoutPixelSNAIL）を通じて、画像内の空間コヒーレンシーとオブジェクトと背景の間のセマンティックコヒーレンシーを維持することを学習します。 PixelSNAILはVQ-VAEの潜在的なエンコーディングの分布を学習しますが、LayoutPixelSNAILは、オブジェクトのセマンティック分布を具体的に学習するために使用されます。私たちのアプローチの暗黙の利点は、生成されたサンプルがオブジェクトレベルの注釈を伴うことです。マルチMNISTおよびCLEVRデータセットでの実験を通じて、コヒーレンシと忠実度がこの方法でどのように維持されるかを示します。これにより、最先端のマルチオブジェクト生成メソッドをしのいでいます。私たちのアプローチの有効性は、医療画像データセットへの適用を通じて示されます。ここで、私たちのアプローチを使用して生成されたサンプルでトレーニングセットを強化すると、既存のモデルのパフォーマンスが向上することが示されます。

Recent developments related to generative models have made it possible to generate diverse high-fidelity images. In particular, layout-to-image generation models have gained significant attention due to their capability to generate realistic complex images containing distinct objects. These models are generally conditioned on either semantic layouts or textual descriptions. However, unlike natural images, providing auxiliary information can be extremely hard in domains such as biomedical imaging and remote sensing. In this work, we propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring their contextual information during the generation process. Based on a vector-quantized variational autoencoder (VQ-VAE) backbone, our model learns to preserve spatial coherency within an image as well as semantic coherency between the objects and the background through two powerful autoregressive priors: PixelSNAIL and LayoutPixelSNAIL. While the PixelSNAIL learns the distribution of the latent encodings of the VQ-VAE, the LayoutPixelSNAIL is used to specifically learn the semantic distribution of the objects. An implicit advantage of our approach is that the generated samples are accompanied by object-level annotations. We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets; thereby outperforming state-of-the-art multi-object generative methods. The efficacy of our approach is demonstrated through application on medical imaging datasets, where we show that augmenting the training set with generated samples using our approach improves the performance of existing models.

updated: Thu Jul 15 2021 21:42:29 GMT+0000 (UTC)

published: Mon Jun 22 2020 11:33:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト