FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion

Martin Pernuš; Clinton Fookes; Vitomir Štruc; Simon Dobrišek

FICE: ガイド付き GAN 反転によるテキスト条件付きファッション画像編集

ファッション画像の編集は、選択したアパレルを特定の入力画像に組み込むことを目標とする、困難なコンピュータービジョンタスクです。仮想試着法として知られるほとんどの既存の技術は、最初に目的のアパレルのサンプル画像を選択し、次に衣服を対象者に移すことによってこのタスクを処理します。逆に、本論文では、ファッション画像を編集してテキストで説明することを検討します。このようなアプローチには、サンプルベースの仮想試着技術よりもいくつかの利点があります。たとえば、(i) 対象のファッションアイテムの画像を必要としない、(ii) さまざまなビジュアルコンセプトを自然言語の使用。言語入力を処理する既存の画像編集方法は、豊富な属性注釈を含むトレーニングセットの要件によって大きく制限されているか、単純なテキストの説明しか処理できません。 FICE (Fashion Image CLIP Editing) と呼ばれる新しいテキスト条件付き編集モデルを提案することで、これらの制約に対処します。このモデルは、編集手順をガイドするために多種多様なテキスト記述を処理できます。具体的には FICE を使用して、画像を生成するときにセマンティック、ポーズ関連、および画像レベルの制約を含めることにより、一般的な GAN 反転プロセスを強化します。印象的な画像とテキストの関連付け機能により、CLIP モデルの機能を活用してセマンティクスを適用します。さらに、合成画像の忠実度をより適切に制御する手段を提供する潜在コード正則化手法を提案します。 VITON 画像と Fashion-Gen テキスト記述の組み合わせに関する厳密な実験と、いくつかの最先端のテキスト調整画像編集アプローチとの比較により、FICE を検証します。実験結果は、FICE が非常にリアルなファッション画像を生成し、既存の競合するアプローチよりも強力な編集パフォーマンスをもたらすことを示しています。

Fashion-image editing represents a challenging computer vision task, where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques, e.g.: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model, called FICE (Fashion Image CLIP Editing), capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically with FICE, we augment the common GAN inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the semantics, due to its impressive image-text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art text-conditioned image editing approaches. Experimental results demonstrate FICE generates highly realistic fashion images and leads to stronger editing performance than existing competing approaches.

updated: Thu Jan 05 2023 15:33:23 GMT+0000 (UTC)

published: Thu Jan 05 2023 15:33:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト