MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Xiangxiang Chu; Limeng Qiao; Xinyu Zhang; Shuang Xu; Fei Wei; Yang Yang; Xiaofei Sun; Yiming Hu; Xinyang Lin; Bo Zhang; Chunhua Shen

We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .

updated: Tue Feb 06 2024 07:16:36 GMT+0000 (UTC)

published: Tue Feb 06 2024 07:16:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト