ReVersion: Diffusion-Based Relation Inversion from Images

Ziqi Huang; Tianxing Wu; Yuming Jiang; Kelvin C. K. Chan; Ziwei Liu

ReVersion: 画像からの拡散ベースの関係反転

拡散モデルは、その生成能力により人気が高まっています。最近では、模範画像から拡散モデルを反転してカスタマイズされた画像を生成する必要性が急増しています。ただし、既存の反転方法は、主にオブジェクトの外観をキャプチャすることに重点を置いています。オブジェクトの関係を反転する方法は、ビジュアルの世界のもう 1 つの重要な柱であり、未踏のままです。この作業では、模範的な画像から特定の関係 (「関係プロンプト」として表される) を学習することを目的とする Relation Inversion タスクの ReVersion を提案します。具体的には、凍結された事前トレーニング済みのテキストから画像への拡散モデルから関係プロンプトを学習します。学習した関係プロンプトを適用して、新しいオブジェクト、背景、およびスタイルを含む関係固有の画像を生成できます。私たちの重要な洞察は「事前の前置詞」です。現実世界の関係プロンプトは、基本的な前置詞の単語のセットでまばらにアクティブ化できます。具体的には、関係プロンプトの 2 つの重要な特性を課すために、新しい関係ステアリング対照的学習スキームを提案します。 2) リレーションプロンプトは、オブジェクトの外観から分離する必要があります。さらに、低レベルの外観 (テクスチャ、色など) よりも高レベルの相互作用を強調するために、関係焦点重要度サンプリングを考案します。この新しいタスクを総合的に評価するために、多様な関係を持つさまざまな模範画像を提供する ReVersion Benchmark を提供します。広範な実験により、幅広い視覚的関係にわたって、既存の方法に対する私たちのアプローチの優位性が検証されます。

Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion for the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt from a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. Our key insight is the "preposition prior" - real-world relation prompts can be sparsely activated upon a set of basis prepositional words. Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior. 2) The relation prompt should be disentangled away from object appearances. We further devise relation-focal importance sampling to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations.

updated: Thu Mar 23 2023 17:56:10 GMT+0000 (UTC)

published: Thu Mar 23 2023 17:56:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト