Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

Su Wang; Chitwan Saharia; Ceslee Montgomery; Jordi Pont-Tuset; Shai Noy; Stefano Pellegrini; Yasumasa Onoe; Sarah Laszlo; David J. Fleet; Radu Soricut; Jason Baldridge; Mohammad Norouzi; Peter Anderson; William Chan

Imagen Editor と EditBench: テキストガイドによる画像修復の進歩と評価

テキストガイドによる画像編集は、創造的なアプリケーションのサポートに変革をもたらす可能性があります。重要な課題は、入力画像と一貫性を保ちながら、入力テキストプロンプトに忠実な編集を生成することです。テキストガイドによる画像修復でImagenを微調整することにより構築されたカスケード拡散モデルであるImagen Editorを紹介します。 Imagen Editor の編集は、テキストプロンプトに忠実であり、トレーニング中にオブジェクト検出器を使用して修復マスクを提案することによって実現されます。さらに、Imagen Editor は、元の高解像度画像でカスケードパイプラインを調整することにより、入力画像の細部をキャプチャします。定性的および定量的評価を改善するために、テキストガイドによる画像修復の体系的なベンチマークである EditBench を導入します。 EditBench は、オブジェクト、属性、およびシーンを探索する自然画像および生成画像の修復編集を評価します。 EditBench で人間による広範な評価を行った結果、トレーニング中のオブジェクトマスキングがテキストと画像の配置を全面的に改善することがわかりました。その結果、DALL-E 2 や Stable Diffusion よりも Imagen Editor が好まれるようになりました。これらのモデルは、テキストレンダリングよりもオブジェクトレンダリングに優れており、カウント/形状属性よりも素材/色/サイズ属性を適切に処理します。

Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.

updated: Wed Apr 12 2023 22:42:08 GMT+0000 (UTC)

published: Tue Dec 13 2022 21:25:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト