M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers

Zhu Zhang; Jianxin Ma; Chang Zhou; Rui Men; Zhikang Li; Ming Ding; Jie Tang; Jingren Zhou; Hongxia Yang

M6-UFC：非自己回帰生成トランスを介した条件付き画像合成のためのマルチモーダル制御の統合

条件付き画像合成は、テキストによる説明、参照画像、保存する画像ブロック、およびそれらの組み合わせの形式で、いくつかのマルチモーダルガイダンスに従って画像を作成することを目的としています。この論文では、これらの制御信号を個別に調査する代わりに、新しい2ステージアーキテクチャであるM6-UFCを提案して、任意の数のマルチモーダル制御を統合します。 M6-UFCでは、多様な制御信号と合成された画像の両方が、Transformerによって処理される一連の個別のトークンとして均一に表されます。 DALL-EやVQGANなどの既存の2段階自己回帰アプローチとは異なり、M6-UFCは第2段階で非自己回帰生成（NAR）を採用して、合成画像の全体的な一貫性を強化し、指定された画像ブロックの保存をサポートします。合成速度を向上させます。さらに、コントロールのコンプライアンスを評価し、合成された画像の忠実度を評価するために開発された2つの推定量の助けを借りて、非自己回帰的に生成された画像を繰り返し改善するプログレッシブアルゴリズムを設計します。新しく収集された大規模な衣類データセットM2C-Fashionと顔のデータセットMulti-ModalCelebA-HQでの広範な実験により、M6-UFCが柔軟なマルチモーダルコントロールに準拠した忠実度の高い画像を合成できることが確認されました。

Conditional image synthesis aims to create an image according to some multi-modal guidance in the forms of textual descriptions, reference images, and image blocks to preserve, as well as their combinations. In this paper, instead of investigating these control signals separately, we propose a new two-stage architecture, M6-UFC, to unify any number of multi-modal controls. In M6-UFC, both the diverse control signals and the synthesized image are uniformly represented as a sequence of discrete tokens to be processed by Transformer. Different from existing two-stage autoregressive approaches such as DALL-E and VQGAN, M6-UFC adopts non-autoregressive generation (NAR) at the second stage to enhance the holistic consistency of the synthesized image, to support preserving specified image blocks, and to improve the synthesis speed. Further, we design a progressive algorithm that iteratively improves the non-autoregressively generated image, with the help of two estimators developed for evaluating the compliance with the controls and evaluating the fidelity of the synthesized image, respectively. Extensive experiments on a newly collected large-scale clothing dataset M2C-Fashion and a facial dataset Multi-Modal CelebA-HQ verify that M6-UFC can synthesize high-fidelity images that comply with flexible multi-modal controls.

updated: Sat Feb 19 2022 17:12:14 GMT+0000 (UTC)

published: Sat May 29 2021 04:42:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト