ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions

Deyao Zhu; Jun Chen; Kilichbek Haydarov; Xiaoqian Shen; Wenxuan Zhang; Mohamed Elhoseiny

ChatGPT の質問、BLIP-2 の回答: 強化された視覚的説明に向けた自動質問

洞察に満ちた質問をすることは、知識を獲得し、世界に対する理解を広げるために不可欠です。しかし、質問に答えるためにモデルが主に開発されてきた AI 研究では、質問の重要性はほとんど見過ごされてきました。 ChatGPT のような大規模言語モデル (LLM) の最近の進歩により、適切なプロンプトが提供されたときに高品質の質問をする能力が発見されました。この発見は、自動質問システムを開発する新たな機会を提供します。このホワイトペーパーでは、画像キャプションに導入された新しい自動質問方式である ChatCaptioner を紹介します。ここで、ChatGPT は、強力なビジョンの質問応答モデルである BLIP-2 に対して、画像に関する一連の有益な質問をするように求められます。 BLIP-2 の回答から新しい視覚情報を取得し続けることで、ChatCaptioner はより充実した画像説明を生成できます。 COCO、Conceptual Caption、WikiArt などの一般的な画像キャプションデータセットで人間を対象とした評価を実施し、ChatCaptioner を BLIP-2 およびグラウンドトゥルースと比較します。私たちの結果は、ChatCaptioner のキャプションが非常に有益であることを示しており、最も多くの画像情報を提供する人間の評価者から 3 倍の票を獲得しています。さらに、ChatCaptioner は、WordNet synset マッチングによって測定された BLIP-2 単独よりも 53% 多くのオブジェクトを画像内で識別します。コードは https://github.com/Vision-CAIR/ChatCaptioner で入手できます

Asking insightful questions is crucial for acquiring knowledge and expanding our understanding of the world. However, the importance of questioning has been largely overlooked in AI research, where models have been primarily developed to answer questions. With the recent advancements of large language models (LLMs) like ChatGPT, we discover their capability to ask high-quality questions when provided with a suitable prompt. This discovery presents a new opportunity to develop an automatic questioning system. In this paper, we introduce ChatCaptioner, a novel automatic-questioning method deployed in image captioning. Here, ChatGPT is prompted to ask a series of informative questions about images to BLIP-2, a strong vision question-answering model. By keeping acquiring new visual information from BLIP-2's answers, ChatCaptioner is able to generate more enriched image descriptions. We conduct human-subject evaluations on common image caption datasets such as COCO, Conceptual Caption, and WikiArt, and compare ChatCaptioner with BLIP-2 as well as ground truth. Our results demonstrate that ChatCaptioner's captions are significantly more informative, receiving three times as many votes from human evaluators for providing the most image information. Besides, ChatCaptioner identifies 53% more objects within the image than BLIP-2 alone measured by WordNet synset matching. Code is available at https://github.com/Vision-CAIR/ChatCaptioner

updated: Sun Mar 12 2023 07:22:08 GMT+0000 (UTC)

published: Sun Mar 12 2023 07:22:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト