Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Matt Deitke; Christopher Clark; Sangho Lee; Rohun Tripathi; Yue Yang; Jae Sung Park; Mohammadreza Salehi; Niklas Muennighoff; Kyle Lo; Luca Soldaini; Jiasen Lu; Taira Anderson; Erin Bransom; Kiana Ehsani; Huong Ngo; YenSung Chen; Ajay Patel; Mark Yatskar; Chris Callison-Burch; Andrew Head; Rose Hendrix; Favyen Bastani; Eli VanderBilt; Nathan Lambert; Yvonne Chou; Arnavi Chheda; Jenna Sparks; Sam Skjonsberg; Michael Schmitz; Aaron Sarnat; Byron Bischoff; Pete Walsh; Chris Newell; Piper Wolters; Tanmay Gupta; Kuo-Hao Zeng; Jon Borchardt; Dirk Groeneveld; Jen Dumas; Crystal Nam; Sophie Lebrecht; Caitlin Wittlif; Carissa Schoenick; Oscar Michel; Ranjay Krishna; Luca Weihs; Noah A. Smith; Hannaneh Hajishirzi; Ross Girshick; Ali Farhadi; Aniruddha Kembhavi

Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at https://molmo.allenai.org.

updated: Wed Sep 25 2024 17:59:51 GMT+0000 (UTC)

published: Wed Sep 25 2024 17:59:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト