MOC-GAN: Mixing Objects and Captions to Generate Realistic Images

Tao Ma; Yikang Li

MOC-GAN: オブジェクトとキャプションを組み合わせてリアルな画像を生成

条件付き説明付きの画像を生成することは、近年関心が高まっています。ただし、既存の条件付き入力は、構造化されていない形式 (キャプション) または限られた情報と高価なラベル (シーングラフ) のいずれかに苦しんでいます。対象となるシーンでは、コアアイテムであるオブジェクトは通常明確ですが、それらの相互作用は柔軟で明確に定義するのが困難です。したがって、オブジェクトとキャプションからリアルなイメージを生成する、より合理的な設定を導入します。この設定の下では、オブジェクトはターゲット画像の重要な役割を明示的に定義し、キャプションはその豊富な属性と接続を暗黙的に説明します。これに対応して、2 つのモダリティの入力を混合してリアルな画像を生成する MOC-GAN が提案されています。まず、キャプションからオブジェクトペア間の暗黙的な関係を推測して、非表示状態のシーングラフを構築します。したがって、オブジェクト、関係、およびキャプションを含む多層表現が構築され、シーングラフがシーンの構造を提供し、キャプションが画像レベルのガイダンスを提供します。次に、キャプション内の最も関連性の高い単語に注意を払うことで、大まかなフレーズパッチを生成するように、カスケード接続された注意深い生成ネットワークが設計されています。さらに、フレーズごとの DAMSM は、きめの細かいフレーズとパッチの一貫性をより適切に監督するために提案されています。 COCO データセットでは、私たちの方法は、高い視覚的品質を維持しながら、インセプションスコアと FID の両方で最先端の方法よりも優れています。広範な実験により、提案された方法のユニークな機能が実証されています。

Generating images with conditional descriptions gains increasing interests in recent years. However, existing conditional inputs are suffering from either unstructured forms (captions) or limited information and expensive labeling (scene graphs). For a targeted scene, the core items, objects, are usually definite while their interactions are flexible and hard to clearly define. Thus, we introduce a more rational setting, generating a realistic image from the objects and captions. Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections. Correspondingly, a MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images. It firstly infers the implicit relations between object pairs from the captions to build a hidden-state scene graph. So a multi-layer representation containing objects, relations and captions is constructed, where the scene graph provides the structures of the scene and the caption provides the image-level guidance. Then a cascaded attentive generative network is designed to coarse-to-fine generate phrase patch by paying attention to the most relevant words in the caption. In addition, a phrase-wise DAMSM is proposed to better supervise the fine-grained phrase-patch consistency. On COCO dataset, our method outperforms the state-of-the-art methods on both Inception Score and FID while maintaining high visual quality. Extensive experiments demonstrate the unique features of our proposed method.

updated: Sun Jun 06 2021 14:04:07 GMT+0000 (UTC)

published: Sun Jun 06 2021 14:04:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト