I2Edit: Towards Multi-turn Interactive Image Editing via Dialogue

Xing Cui; Zekun Li; Peipei Li; Yibo Hu; Hailin Shi; Zhaofeng He

I2Edit: 対話によるマルチターンのインタラクティブな画像編集に向けて

制御可能な顔画像編集に関してかなりの研究努力がなされてきましたが、ユーザーがシステムと対話して要件を動的に調整できる望ましいインタラクティブな設定は十分に調査されていません。このホワイトペーパーでは、対話による顔の画像編集に焦点を当て、現実世界のインタラクティブな顔の編集シナリオで画像編集の品質とインタラクション能力を評価するための新しいベンチマークデータセットであるマルチターンインタラクティブ画像編集 (I2Edit) を紹介します。データセットは、ユーザーの編集要件に対応するマルチターンダイアログで注釈が付けられた画像を使用して、CelebA-HQ データセットに基づいて構築されます。 I2Edit は、1) 動的に更新されるユーザーの要求を追跡し、それに応じて画像を編集し、2) ユーザーと通信するための適切な自然言語応答を生成する必要があるため、困難です。これらの課題に対処するために、対話モジュールと画像編集モジュールで構成されるフレームワークを提案します。前者は、ユーザーの編集要件を追跡し、対応する指示的な応答を生成するためのものであり、後者は、追跡されたユーザーの編集要件に基づいて画像を編集するものです。マルチターンインタラクションを単一ターンインタラクションのシーケンスとして単純に扱う以前の作業とは対照的に、現在の単一ターンではなく、対話履歴全体からユーザー編集要件を抽出します。抽出されたグローバルユーザー編集要件により、入力された生の画像を直接編集して、エラーの蓄積や属性の忘却の問題を回避できます。 I2Edit データセットでの広範な定量的および定性的な実験は、以前のシングルターン法よりも提案されたフレームワークの利点を示しています。私たちの新しいデータセットは、現実世界の複雑でインタラクティブな画像編集の探求を進めるための貴重なリソースとして役立つと信じています。コードとデータは公開されます。

Although there have been considerable research efforts on controllable facial image editing, the desirable interactive setting where the users can interact with the system to adjust their requirements dynamically hasn't been well explored. This paper focuses on facial image editing via dialogue and introduces a new benchmark dataset, Multi-turn Interactive Image Editing (I2Edit), for evaluating image editing quality and interaction ability in real-world interactive facial editing scenarios. The dataset is constructed upon the CelebA-HQ dataset with images annotated with a multi-turn dialogue that corresponds to the user editing requirements. I2Edit is challenging, as it needs to 1) track the dynamically updated user requirements and edit the images accordingly, as well as 2) generate the appropriate natural language response to communicate with the user. To address these challenges, we propose a framework consisting of a dialogue module and an image editing module. The former is for user edit requirements tracking and generating the corresponding indicative responses, while the latter edits the images conditioned on the tracked user edit requirements. In contrast to previous works that simply treat multi-turn interaction as a sequence of single-turn interactions, we extract the user edit requirements from the whole dialogue history instead of the current single turn. The extracted global user edit requirements enable us to directly edit the input raw image to avoid error accumulation and attribute forgetting issues. Extensive quantitative and qualitative experiments on the I2Edit dataset demonstrate the advantage of our proposed framework over the previous single-turn methods. We believe our new dataset could serve as a valuable resource to push forward the exploration of real-world, complex interactive image editing. Code and data will be made public.

updated: Thu Mar 23 2023 08:32:29 GMT+0000 (UTC)

published: Mon Mar 20 2023 13:45:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト