Towards Real-time Text-driven Image Manipulation with Unconditional Diffusion Models

Nikita Starodubcev; Dmitry Baranchuk; Valentin Khrulkov; Artem Babenko

無条件拡散モデルによるリアルタイムのテキスト駆動型画像操作に向けて

拡散モデルの最近の進歩により、画像編集用の多くの強力なツールが可能になりました。これらのツールの 1 つは、テキスト駆動型の画像操作です。つまり、提供されたテキストの説明に従って画像のセマンティック属性を編集します。 % 人気のあるテキスト条件付き拡散モデルは、幅広いテキストプロンプトに対してさまざまな高品質の画像操作方法を提供します。既存の拡散ベースの方法は、幅広いテキストプロンプトに対して高品質の画像操作を既に実現しています。ただし、実際には、これらの方法は、ハイエンドの GPU を使用しても高い計算コストが必要です。これにより、特にユーザーデバイスで実行する場合、拡散ベースの画像編集の潜在的な実世界でのアプリケーションが大幅に制限されます。この論文では、無条件拡散モデルに基づく最近のテキスト駆動型編集方法の効率性に取り組み、画像操作を 4.5 倍から 10 倍速く学習し、8 倍速く適用する新しいアルゴリズムを開発します。ヒューマン・アノテーターを使用して、複数のデータセットに対するアプローチの視覚的品質と表現力を慎重に評価します。私たちの実験は、私たちのアルゴリズムがはるかに高価な方法の品質を達成することを示しています.最後に、私たちのアプローチにより、事前トレーニング済みのモデルを、ユーザーが指定した画像とテキストの説明にその場で 4 秒間適応できることを示します。この設定では、よりコンパクトな無条件拡散モデルが、一般的なテキスト条件付き対応モデルの合理的な代替と見なすことができることに気付きました。

Recent advances in diffusion models enable many powerful instruments for image editing. One of these instruments is text-driven image manipulations: editing semantic attributes of an image according to the provided text description. % Popular text-conditional diffusion models offer various high-quality image manipulation methods for a broad range of text prompts. Existing diffusion-based methods already achieve high-quality image manipulations for a broad range of text prompts. However, in practice, these methods require high computation costs even with a high-end GPU. This greatly limits potential real-world applications of diffusion-based image editing, especially when running on user devices. In this paper, we address efficiency of the recent text-driven editing methods based on unconditional diffusion models and develop a novel algorithm that learns image manipulations 4.5-10 times faster and applies them 8 times faster. We carefully evaluate the visual quality and expressiveness of our approach on multiple datasets using human annotators. Our experiments demonstrate that our algorithm achieves the quality of much more expensive methods. Finally, we show that our approach can adapt the pretrained model to the user-specified image and text description on the fly just for 4 seconds. In this setting, we notice that more compact unconditional diffusion models can be considered as a rational alternative to the popular text-conditional counterparts.

updated: Mon Apr 10 2023 01:21:56 GMT+0000 (UTC)

published: Mon Apr 10 2023 01:21:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト