DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Models

Jaemin Cho; Abhay Zala; Mohit Bansal

DALL-Eval: Text-to-Image 生成モデルの推論スキルと社会的バイアスの調査

最近、マルチモーダル変換言語モデルである DALL-E とその変種 (拡散モデルを含む) が、高品質のテキストから画像への生成機能を示しました。しかし、興味深い画像生成結果にもかかわらず、そのようなモデルを評価する方法に関する詳細な分析は行われていません。この作業では、さまざまなテキストから画像へのモデルの視覚的推論機能と社会的バイアスを調査し、マルチモーダルトランスフォーマー言語モデルと拡散モデルの両方をカバーします。まず、物体認識、物体カウント、空間関係理解の 3 つの視覚的推論スキルを測定します。このために、これらのスキルを測定する構成診断データセットおよび評価ツールキットである PaintSkills を提案します。私たちの実験では、最近のテキストから画像へのモデルのパフォーマンスと、オブジェクトのカウントと空間関係の理解スキルの上限精度との間に大きなギャップが存在します。次に、自動評価と人間による評価に基づいて性別/肌色分布の分散を測定することにより、性別と肌色の偏りを評価します。最近のテキストから画像へのモデルが、ウェブの画像とテキストのペアから特定の性別/肌の色合いのバイアスを学習することを示しています。私たちの仕事が、視覚的推論スキルに関するテキストから画像への生成モデルの改善と、社会的に公平な表現の学習における将来の進歩を導くのに役立つことを願っています.コードとデータ: https://github.com/j-min/DallEval

Recently, DALL-E, a multimodal transformer language model, and its variants (including diffusion models) have shown high-quality text-to-image generation capabilities. However, despite the interesting image generation results, there has not been a detailed analysis on how to evaluate such models. In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. For this, we propose PaintSkills, a compositional diagnostic dataset and evaluation toolkit that measures these skills. In our experiments, there exists a large gap between the performance of recent text-to-image models and the upper bound accuracy in object counting and spatial relation understanding skills. Second, we assess gender and skin tone biases by measuring the variance of the gender/skin tone distribution based on automated and human evaluation. We demonstrate that recent text-to-image models learn specific gender/skin tone biases from web image-text pairs. We hope that our work will help guide future progress in improving text-to-image generation models on visual reasoning skills and learning socially unbiased representations. Code and data: https://github.com/j-min/DallEval

updated: Mon Nov 14 2022 18:39:38 GMT+0000 (UTC)

published: Tue Feb 08 2022 18:36:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト