Context-Aware Layout to Image Generation with Enhanced Object Appearance

Sen He; Wentong Liao; Michael Ying Yang; Yongxin Yang; Yi-Zhe Song; Bodo Rosenhahn; Tao Xiang

強化されたオブジェクトの外観を備えた画像生成へのコンテキストアウェアレイアウト

画像へのレイアウト（L2I）生成モデルは、特定のレイアウトを条件として、自然の背景（もの）に対して複数のオブジェクト（もの）を含む複雑な画像を生成することを目的としています。生成的敵対的ネットワーク（GAN）の最近の進歩に基づいて構築され、既存のL2Iモデルは大きな進歩を遂げました。ただし、生成された画像を詳しく調べると、2つの大きな制限が明らかになります。（1）オブジェクトとオブジェクト、およびオブジェクトとスタッフの関係が壊れていることがよくあります。（2）各オブジェクトの外観は通常、歪んでいて、主要な定義特性がありません。オブジェクトクラスに関連付けられています。これらは、ジェネレーターでのコンテキスト認識オブジェクトおよびスタッフ機能のエンコードの欠如、およびディスクリミネーターでの場所に依存する外観表現が原因であると主張します。これらの制限に対処するために、この作業では2つの新しいモジュールが提案されています。まず、コンテキストアウェアな特徴変換モジュールがジェネレーターに導入され、オブジェクトまたはスタッフの生成された特徴エンコーディングがシーン内の他の共存するオブジェクト/スタッフを認識できるようにします。次に、位置に依存しない画像の特徴を弁別器に供給する代わりに、生成されたオブジェクト画像の特徴マップから計算されたグラム行列を使用して、位置に敏感な情報を保持し、オブジェクトの外観を大幅に向上させます。広範な実験は、提案された方法がCOCO-Thing-StuffおよびVisualGenomeベンチマークで最先端のパフォーマンスを達成することを示しています。

A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff), conditioned on a given layout. Built upon the recent advances in generative adversarial networks (GANs), existing L2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) the object-to-object as well as object-to-stuff relations are often broken and (2) each object's appearance is typically distorted lacking the key defining characteristics associated with the object class. We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators. To address these limitations, two new modules are proposed in this work. First, a context-aware feature transformation module is introduced in the generator to ensure that the generated feature encoding of either object or stuff is aware of other co-existing objects/stuff in the scene. Second, instead of feeding location-insensitive image features to the discriminator, we use the Gram matrix computed from the feature maps of the generated object images to preserve location-sensitive information, resulting in much enhanced object appearance. Extensive experiments show that the proposed method achieves state-of-the-art performance on the COCO-Thing-Stuff and Visual Genome benchmarks.

updated: Mon Mar 22 2021 14:43:25 GMT+0000 (UTC)

published: Mon Mar 22 2021 14:43:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト