TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering

Yushi Hu; Benlin Liu; Jungo Kasai; Yizhong Wang; Mari Ostendorf; Ranjay Krishna; Noah A. Smith

TIFA: 質問応答による正確で解釈可能なテキストから画像への忠実度評価

何千人もの研究者、エンジニア、アーティストがテキストから画像への生成モデルの改善に積極的に取り組んでいますが、システムはテキスト入力と正確に一致する画像を生成できないことがよくあります。視覚的質問応答 (VQA) を介して生成された画像のテキスト入力に対する忠実度を測定する自動評価メトリックである TIFA (質問応答によるテキストから画像への忠実度評価) を紹介します。具体的には、テキスト入力が与えられると、言語モデルを使用していくつかの質問と回答のペアを自動的に生成します。生成された画像を使用して、既存の VQA モデルがこれらの質問に答えることができるかどうかを確認することにより、画像の忠実度を計算します。 TIFA は、生成された画像の詳細で解釈可能な評価を可能にする参照のないメトリックです。また、TIFA は、既存の指標よりも人間の判断との相関が優れています。このアプローチに基づいて、4K の多様なテキスト入力と 12 のカテゴリ (オブジェクト、カウントなど) にわたる 25K の質問で構成されるベンチマークである TIFA v1.0 を導入します。 TIFA v1.0 を使用した既存のテキストから画像へのモデルの包括的な評価を提示し、現在のモデルの制限と課題を強調します。たとえば、現在のテキストから画像へのモデルは、色と素材ではうまく機能しているにもかかわらず、カウント、空間関係、および複数のオブジェクトの構成にまだ苦労しています。私たちのベンチマークが、テキストから画像への合成における研究の進歩を慎重に測定し、さらなる研究のための貴重な洞察を提供するのに役立つことを願っています.

Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image. TIFA is a reference-free metric that allows for fine-grained and interpretable evaluations of generated images. TIFA also has better correlations with human judgments than existing metrics. Based on this approach, we introduce TIFA v1.0, a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.). We present a comprehensive evaluation of existing text-to-image models using TIFA v1.0 and highlight the limitations and challenges of current models. For instance, we find that current text-to-image models, despite doing well on color and material, still struggle in counting, spatial relations, and composing multiple objects. We hope our benchmark will help carefully measure the research progress in text-to-image synthesis and provide valuable insights for further research.

updated: Tue Mar 21 2023 14:41:02 GMT+0000 (UTC)

published: Tue Mar 21 2023 14:41:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト