Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

Qingping Zheng; Yuanfan Guo; Jiankang Deng; Jianhua Han; Ying Li; Songcen Xu; Hang Xu

Any-Size-Diffusion: あらゆるサイズの HD 画像の効率的なテキスト駆動合成に向けて

安定拡散 (テキストから画像への合成で使用される生成モデル) では、さまざまなサイズの画像を生成するときに、解像度に起因する合成の問題が頻繁に発生します。この問題は主に、単一スケールの画像とそれに対応するテキスト説明のペアでトレーニングされているモデルに起因します。さらに、無制限のサイズの画像を直接トレーニングすることは、膨大な数のテキストと画像のペアが必要であり、かなりの計算コストがかかるため、現実的ではありません。これらの課題を克服するために、高メモリ GPU リソースの必要性を最小限に抑えながら、あらゆるサイズの適切に構成された画像を効率的に生成するように設計された Any-Size-Diffusion (ASD) という 2 段階のパイプラインを提案します。具体的には、Any Ratio Adaptability Diffusion (ARAD) と呼ばれる初期段階では、比率範囲が制限された選択された画像セットを利用してテキスト条件付き拡散モデルを最適化し、それによって多様な画像サイズに対応するように構成を調整する機能が向上します。任意のサイズでの画像の作成をサポートするために、後続の段階で Fast Seamless Tiled Diffusion (FSTD) と呼ばれる技術をさらに導入します。この方法により、ASD 出力を任意の高解像度サイズに迅速に拡大でき、継ぎ目によるアーティファクトやメモリの過負荷を回避できます。 LAION-COCO および MM-CelebA-HQ ベンチマークの実験結果は、ASD が任意のサイズの適切に構造化された画像を生成し、従来のタイルアルゴリズムと比較して推論時間を 2 倍に短縮できることを示しています。

Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.

updated: Mon Sep 11 2023 07:44:49 GMT+0000 (UTC)

published: Thu Aug 31 2023 09:27:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト