UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis

Zhu Zhang; Jianxin Ma; Chang Zhou; Rui Men; Zhikang Li; Ming Ding; Jie Tang; Jingren Zhou; Hongxia Yang

UFC-BERT: 条件付き画像合成のためのマルチモーダル制御の統合

条件付き画像合成は、テキストの説明、参照画像、保存する画像ブロック、およびそれらの組み合わせの形式で、いくつかのマルチモーダルガイダンスに従って画像を作成することを目的としています。この論文では、これらの制御信号を個別に調査する代わりに、任意の数のマルチモーダル制御を統合するために、新しい 2 ステージアーキテクチャである UFC-BERT を提案します。 UFC-BERT では、さまざまな制御信号と合成画像の両方が、Transformer によって処理される一連の個別のトークンとして均一に表現されます。 DALL-E や VQGAN などの既存の 2 段階の自己回帰アプローチとは異なり、UFC-BERT は第 2 段階で非自己回帰生成 (NAR) を採用して、合成画像の全体的な一貫性を高め、指定された画像ブロックの保存をサポートし、合成速度を向上させます。さらに、制御への準拠を評価するために開発された 2 つの推定器と、合成された画像の忠実度をそれぞれ評価するために開発された 2 つの推定器の助けを借りて、非自己回帰的に生成された画像を繰り返し改善する漸進的アルゴリズムを設計します。新しく収集された大規模な衣類データセット M2C-Fashion と顔のデータセットマルチモーダル CelebA-HQ に関する広範な実験により、UFC-BERT が柔軟なマルチモーダル制御に準拠した忠実度の高い画像を合成できることが確認されました。

Conditional image synthesis aims to create an image according to some multi-modal guidance in the forms of textual descriptions, reference images, and image blocks to preserve, as well as their combinations. In this paper, instead of investigating these control signals separately, we propose a new two-stage architecture, UFC-BERT, to unify any number of multi-modal controls. In UFC-BERT, both the diverse control signals and the synthesized image are uniformly represented as a sequence of discrete tokens to be processed by Transformer. Different from existing two-stage autoregressive approaches such as DALL-E and VQGAN, UFC-BERT adopts non-autoregressive generation (NAR) at the second stage to enhance the holistic consistency of the synthesized image, to support preserving specified image blocks, and to improve the synthesis speed. Further, we design a progressive algorithm that iteratively improves the non-autoregressively generated image, with the help of two estimators developed for evaluating the compliance with the controls and evaluating the fidelity of the synthesized image, respectively. Extensive experiments on a newly collected large-scale clothing dataset M2C-Fashion and a facial dataset Multi-Modal CelebA-HQ verify that UFC-BERT can synthesize high-fidelity images that comply with flexible multi-modal controls.

updated: Sat May 29 2021 04:42:07 GMT+0000 (UTC)

published: Sat May 29 2021 04:42:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト