AltDiffusion: A Multilingual Text-to-Image Diffusion Model

Fulong Ye; Guang Liu; Xinya Wu; Ledell Wu

AltDiffusion: 多言語テキストから画像への拡散モデル

Large Text-to-Image (T2I) 拡散モデルは、テキスト入力に基づいて写真のようにリアルで多様な画像を生成する驚くべき能力を示しています。しかし、既存の作品は、英語、中国語、日本語などの限られた言語入力しかサポートしていないため、これらの言語以外のユーザーへのサービスが十分に受けられず、T2I モデルの世界的展開が妨げられています。したがって、この論文では、18 の異なる言語をサポートする新しい多言語 T2I 普及モデルである AltDiffusion を紹介します。具体的には、まず知識の蒸留に基づいて多言語テキストエンコーダーをトレーニングします。次に、それを事前トレーニング済みの英語のみの普及モデルに接続し、大規模な多言語データセットでの概念の調整と品質向上の段階を含む多言語機能を強化するために、2 段階のスキーマでモデルをトレーニングします。さらに、多言語一般 18 (MG-18) および多言語文化 18 (MC-18) データセットを含む新しいベンチマークを導入し、高品質の画像を生成し文化をキャプチャするための T2I 普及モデルの機能を評価します。 - さまざまな言語での固有の概念。 MG-18 と MC-18 の両方に関する実験結果は、AltDiffusion が現在の最先端の T2I モデル、たとえば多言語理解、特に文化固有の概念に関して安定した拡散を上回るパフォーマンスを示し、同時に同等の生成能力を備えていることを示しています。高品質の画像。

Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images.

updated: Sat Aug 19 2023 11:52:12 GMT+0000 (UTC)

published: Sat Aug 19 2023 11:52:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト