Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

Junxian Li; Di Zhang; Xunzhi Wang; Zeying Hao; Jingdi Lei; Qian Tan; Cai Zhou; Wei Liu; Weiyun Wang; Zhe Chen; Wenhai Wang; Wei Li; Shufei Zhang; Mao Su; Wanli Ouyang; Yuqiang Li; Dongzhan Zhou

In this technical report, we propose ChemVLM, the first open-source multimodal large language model dedicated to the fields of chemistry, designed to address the incompatibility between chemical image understanding and text analysis. Built upon the VIT-MLP-LLM architecture, we leverage ChemLLM-20B as the foundational large model, endowing our model with robust capabilities in understanding and utilizing chemical text knowledge. Additionally, we employ InternVIT-6B as a powerful image encoder. We have curated high-quality data from the chemical domain, including molecules, reaction formulas, and chemistry examination data, and compiled these into a bilingual multimodal question-answering dataset. We test the performance of our model on multiple open-source benchmarks and three custom evaluation sets. Experimental results demonstrate that our model achieves excellent performance, securing state-of-the-art results in five out of six involved tasks. Our model can be found at https://huggingface.co/AI4Chem/ChemVLM-26B.

updated: Wed Aug 14 2024 01:16:40 GMT+0000 (UTC)

published: Wed Aug 14 2024 01:16:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト