DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

Jaemin Cho; Abhay Zala; Mohit Bansal

DALL-Eval: テキストから画像への生成モデルの推論スキルと社会的バイアスを調査する

最近、マルチモーダル変換言語モデルである DALL-E と、拡散モデルを含むそのバリアントが、高品質のテキストから画像への生成機能を示しました。しかし、現実的な画像生成結果にもかかわらず、そのようなモデルをどのように評価するかについての詳細な分析は行われていませんでした。この研究では、マルチモーダル変換言語モデルと拡散モデルの両方を対象として、さまざまなテキストから画像へのモデルの視覚的推論能力と社会的バイアスを調査します。まず、物体認識、物体カウント、空間関係理解という 3 つの視覚的推論スキルを測定します。そこで、これらのスキルを測定する構成診断評価データセットであるPaintSkillsを提案します。高忠実度の画像生成機能にもかかわらず、最近のモデルのパフォーマンスと、オブジェクトのカウントおよび空間関係の理解スキルの上限精度との間には大きなギャップが存在します。次に、さまざまな職業や属性にわたって生成された画像の性別/肌の色調の分布を測定することで、性別と肌の色調のバイアスを評価します。最近のテキストから画像への生成モデルが、Web の画像とテキストのペアから性別と肌の色に関する特定のバイアスを学習することを実証します。私たちの研究が、視覚的推論スキルに関するテキストから画像への生成モデルの改善と社会的に公平な表現の学習における将来の進歩に役立つことを願っています。コードとデータ: https://github.com/j-min/DallEval

Recently, DALL-E, a multimodal transformer language model, and its variants, including diffusion models, have shown high-quality text-to-image generation capabilities. However, despite the realistic image generation results, there has not been a detailed analysis of how to evaluate such models. In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. For this, we propose PaintSkills, a compositional diagnostic evaluation dataset that measures these skills. Despite the high-fidelity image generation capability, a large gap exists between the performance of recent models and the upper bound accuracy in object counting and spatial relation understanding skills. Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images across various professions and attributes. We demonstrate that recent text-to-image generation models learn specific biases about gender and skin tone from web image-text pairs. We hope our work will help guide future progress in improving text-to-image generation models on visual reasoning skills and learning socially unbiased representations. Code and data: https://github.com/j-min/DallEval

updated: Wed Aug 30 2023 18:41:01 GMT+0000 (UTC)

published: Tue Feb 08 2022 18:36:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト