Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

Jaskirat Singh; Liang Zheng

分割、評価、および調整: 反復的な VQA フィードバックによるテキストと画像の位置合わせの評価と改善

テキスト条件付き画像生成の分野は、潜在拡散モデルの最近の出現により、比類のない進歩を遂げました。注目すべきことですが、特定のテキスト入力の複雑さが増すにつれて、最先端の拡散モデルでは、特定のプロンプトのセマンティクスを正確に伝える画像を生成できない可能性があります。さらに、このような位置ずれは、CLIP などの事前学習済みマルチモーダルモデルでは検出されないことが多いことが観察されています。これらの問題に対処するために、このホワイトペーパーでは、テキストと画像の位置合わせの評価と改善の両方に向けた、シンプルかつ効果的な分解アプローチを検討します。特に、最初に、複雑なプロンプトが与えられると、それを一連の素なアサーションに分解する Decompositional-Alignment-Score を導入します。次に、VQA モデルを使用して、生成されたイメージと各アサーションの整合性が測定されます。最後に、さまざまなアサーションのアライメントスコアが事後的に結合されて、最終的なテキストと画像のアライメントスコアが得られます。実験分析の結果、提案されたアライメント指標は、従来の CLIP、BLIP スコアとは対照的に、人間の評価と著しく高い相関関係を示していることが明らかになりました。さらに、アサーションレベルのアライメントスコアが、最終的な画像出力におけるさまざまなアサーションの表現を徐々に増やすための単純な反復手順で使用できる有用なフィードバックを提供することもわかりました。ユーザーによる調査によると、提案されたアプローチは、全体的なテキストと画像の位置合わせの精度において、以前の最先端技術を 8.7% 上回っています。私たちの論文のプロジェクトページは https://1jsingh.github.io/divide-evaluate-and-refine から入手できます。

The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While remarkable, as the complexity of given text input increases, the state-of-the-art diffusion models may still fail in generating images which accurately convey the semantics of the given prompt. Furthermore, it has been observed that such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex prompt decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for different assertions are combined aposteriori to give the final text-to-image alignment score. Experimental analysis reveals that the proposed alignment metric shows significantly higher correlation with human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also find that the assertion level alignment scores provide a useful feedback which can then be used in a simple iterative procedure to gradually increase the expression of different assertions in the final image outputs. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy. Project page for our paper is available at https://1jsingh.github.io/divide-evaluate-and-refine

updated: Wed Dec 06 2023 00:45:08 GMT+0000 (UTC)

published: Mon Jul 10 2023 17:54:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト