Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination

Yue Yang; Wenlin Yao; Hongming Zhang; Xiaoyang Wang; Dong Yu; Jianshu Chen

Z-LaVI: 視覚的想像力に支えられたゼロショット言語ソルバー

大規模な事前トレーニング済み言語モデルは、下流の言語理解タスクの解決に大きな進歩をもたらしました。しかし、彼らは一般的に、「オレンジはオレンジだ」など、書かれたテキストに明白な常識的知識が欠けていることを表す現象である報告バイアスに苦しんでいます。この制限を克服するために、言語モデルに視覚的想像力を与える新しいアプローチ、Z-LaVI を開発します。具体的には、(i) 検索による既存の画像の想起と、(ii) テキストから画像への生成による存在しない画像の合成という 2 つの補完的なタイプの「想像力」を活用します。言語入力と想像力を共同で活用することで、事前に訓練された視覚言語モデル (たとえば、CLIP) は、最終的に元の言語タスクに対するゼロショットソリューションを構成します。特に、言語モデルに想像力を働かせることで、視覚的な知識を効果的に活用して平易な言語タスクを解決できます。その結果、Z-LaVI は、さまざまな言語タスクのセット全体で、既存の言語モデルのゼロショットパフォーマンスを一貫して向上させます。

Large-scale pretrained language models have made significant advances in solving downstream language understanding tasks. However, they generally suffer from reporting bias, the phenomenon describing the lack of explicit commonsense knowledge in written text, e.g., ''an orange is orange''. To overcome this limitation, we develop a novel approach, Z-LaVI, to endow language models with visual imagination capabilities. Specifically, we leverage two complementary types of ''imaginations'': (i) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation. Jointly exploiting the language inputs and the imagination, a pretrained vision-language model (e.g., CLIP) eventually composes a zero-shot solution to the original language tasks. Notably, fueling language models with imagination can effectively leverage visual knowledge to solve plain language tasks. In consequence, Z-LaVI consistently improves the zero-shot performance of existing language models across a diverse set of language tasks.

updated: Fri Oct 21 2022 21:33:10 GMT+0000 (UTC)

published: Fri Oct 21 2022 21:33:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト