Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Kohei Uehara; Nabarun Goswami; Hanqin Wang; Toshiaki Baba; Kohtaro Tanaka; Tomohiro Hashimoto; Kai Wang; Rei Ito; Takagi Naoya; Ryo Umagami; Yingyi Wen; Tanachai Anakewat; Tatsuya Harada

The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we created, we fine-tuned an existing VLM. This training enabled the models to generate questions and perform iterative reasoning during inference. The results demonstrated a stride toward a more robust, accurate, and interpretable VLM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.

updated: Thu Jul 18 2024 02:35:30 GMT+0000 (UTC)

published: Thu Jan 18 2024 14:21:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト