Visual Conceptual Blending with Large-scale Language and Vision Models

Songwei Ge; Devi Parikh

大規模な言語およびビジョンモデルとの視覚的概念ブレンディング

私たちは質問をします：最近の大規模な言語と画像生成モデルは視覚的概念をどの程度ブレンドすることができますか？任意のオブジェクトが与えられると、関連するオブジェクトを識別し、言語モデルを使用して2つのブレンドの単一文の説明を生成します。次に、テキストベースの画像生成モデルを使用して、ブレンドの視覚的描写を生成します。定量的および定性的評価は、概念ブレンディングの古典的な方法に対する言語モデルの優位性、および視覚的描写の以前のモデルに対する最近の大規模な画像生成モデルの優位性を示しています。

We ask the question: to what extent can recent large-scale language and image generation models blend visual concepts? Given an arbitrary object, we identify a relevant object and generate a single-sentence description of the blend of the two using a language model. We then generate a visual depiction of the blend using a text-based image generation model. Quantitative and qualitative evaluations demonstrate the superiority of language models over classical methods for conceptual blending, and of recent large-scale image generation models over prior models for the visual depiction.

updated: Sun Jun 27 2021 02:48:39 GMT+0000 (UTC)

published: Sun Jun 27 2021 02:48:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト