Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li; Jianwei Zhang; Qin Lin; Jiangfeng Xiong; Yanxin Long; Xinchi Deng; Yingfang Zhang; Xingchao Liu; Minbin Huang; Zedong Xiao; Dayou Chen; Jiajun He; Jiahao Li; Wenyue Li; Chen Zhang; Rongwei Quan; Jianxiang Lu; Jiabin Huang; Xiaoyan Yuan; Xiaoxiao Zheng; Yixuan Li; Jihong Zhang; Chao Zhang; Meng Chen; Jie Liu; Zheng Fang; Weiyan Wang; Jinbao Xue; Yangyu Tao; Jianchen Zhu; Kai Liu; Sihuan Lin; Yifu Sun; Yun Li; Dongdong Wang; Mingtao Chen; Zhichao Hu; Xiao Xiao; Yan Chen; Yuhong Liu; Wei Liu; Di Wang; Yong Yang; Jie Jiang; Qinglin Lu

We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT

updated: Tue May 14 2024 16:33:25 GMT+0000 (UTC)

published: Tue May 14 2024 16:33:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト