Entity-Level Text-Guided Image Manipulation

Yikai Wang; Jianan Wang; Guansong Lu; Hang Xu; Zhenguo Li; Wei Zhang; Yanwei Fu

エンティティレベルのテキストガイドによる画像操作

既存のテキストガイドによる画像操作方法は、画像の外観を変更したり、仮想または単純なシナリオでいくつかのオブジェクトを編集したりすることを目的としていますが、これは実用的なアプリケーションにはほど遠いものです。この作業では、現実世界のエンティティレベルでのテキストガイド付き画像操作 (eL-TGIM) に関する新しいタスクを研究します。このタスクでは、(1) テキストの説明と一致するエンティティを編集すること、(2) エンティティに関係のない領域を保持すること、(3) 操作されたエンティティを自然に画像にマージすることの 3 つの基本的な要件が課せられます。この目的のために、エンティティの外観を編集できるだけでなく、テキストガイダンスに対応する新しいエンティティを生成できる現実世界の画像のセマンティック操作を形成する、SeMani と呼ばれるエレガントなフレームワークを提案します。 eL-TGIM を解決するために、SeMani はタスクをセマンティックアラインメントフェーズとイメージ操作フェーズの 2 つのフェーズに分解します。セマンティックアラインメントフェーズでは、SeMani はセマンティックアラインメントモジュールを組み込んで、操作対象のエンティティ関連領域を特定します。画像操作フェーズでは、SeMani は生成モデルを採用して、エンティティに関係のない領域とターゲットテキストの説明に基づいて条件付けされた新しい画像を合成します。 SeMani で利用できる 2 つの一般的な生成プロセスについて説明し、提案します。これは、変換器を使用した離散自己回帰生成と、拡散モデルを使用した連続ノイズ除去生成であり、それぞれ SeMani-Trans と SeMani-Diff が生成されます。実際のデータセットである CUB、Oxford、および COCO データセットで広範な実験を行い、SeMani がエンティティ関連領域と非関連領域を区別し、ベースラインメソッドと比較してゼロショット方式でより正確かつ柔軟な操作を実現できることを確認します。コードとモデルは https://github.com/Yikai-Wang/SeMani で公開されます。

Existing text-guided image manipulation methods aim to modify the appearance of the image or to edit a few objects in a virtual or simple scenario, which is far from practical applications. In this work, we study a novel task on text-guided image manipulation on the entity level in the real world (eL-TGIM). The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the entity-irrelevant regions, and (3) to merge the manipulated entity into the image naturally. To this end, we propose an elegant framework, dubbed as SeMani, forming the Semantic Manipulation of real-world images that can not only edit the appearance of entities but also generate new entities corresponding to the text guidance. To solve eL-TGIM, SeMani decomposes the task into two phases: the semantic alignment phase and the image manipulation phase. In the semantic alignment phase, SeMani incorporates a semantic alignment module to locate the entity-relevant region to be manipulated. In the image manipulation phase, SeMani adopts a generative model to synthesize new images conditioned on the entity-irrelevant regions and target text descriptions. We discuss and propose two popular generation processes that can be utilized in SeMani, the discrete auto-regressive generation with transformers and the continuous denoising generation with diffusion models, yielding SeMani-Trans and SeMani-Diff, respectively. We conduct extensive experiments on the real datasets CUB, Oxford, and COCO datasets to verify that SeMani can distinguish the entity-relevant and -irrelevant regions and achieve more precise and flexible manipulation in a zero-shot manner compared with baseline methods. Our codes and models will be released at https://github.com/Yikai-Wang/SeMani.

updated: Wed Feb 22 2023 13:56:23 GMT+0000 (UTC)

published: Wed Feb 22 2023 13:56:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト