PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer; Andreas Steiner; André Susano Pinto; Alexander Kolesnikov; Xiao Wang; Daniel Salz; Maxim Neumann; Ibrahim Alabdulmohsin; Michael Tschannen; Emanuele Bugliarello; Thomas Unterthiner; Daniel Keysers; Skanda Koppula; Fangyu Liu; Adam Grycner; Alexey Gritsenko; Neil Houlsby; Manoj Kumar; Keran Rong; Julian Eisenschlos; Rishabh Kabra; Matthias Bauer; Matko Bošnjak; Xi Chen; Matthias Minderer; Paul Voigtlaender; Ioana Bica; Ivana Balazevic; Joan Puigcerver; Pinelopi Papalampidi; Olivier Henaff; Xi Xiong; Radu Soricut; Jeremiah Harmsen; Xiaohua Zhai

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

updated: Thu Oct 10 2024 17:28:23 GMT+0000 (UTC)

published: Wed Jul 10 2024 14:57:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト