Prismer: A Vision-Language Model with An Ensemble of Experts

Shikun Liu; Linxi Fan; Edward Johns; Zhiding Yu; Chaowei Xiao; Anima Anandkumar

Prismer: 専門家集団による視覚言語モデル

最近の視覚言語モデルは、印象的なマルチモーダル生成機能を示しています。ただし、通常、大規模なデータセットで巨大なモデルをトレーニングする必要があります。よりスケーラブルな代替手段として、ドメインエキスパートのアンサンブルを活用するデータおよびパラメーター効率の高いビジョン言語モデルである Prismer を紹介します。 Prismer は、少数のコンポーネントのトレーニングのみを必要とし、ネットワークの重みの大部分は、すぐに利用できる事前トレーニング済みのドメインエキスパートから継承され、トレーニング中に凍結されたままになります。幅広い分野の専門家を活用することで、Prismer がこの専門知識を効率的にプールし、さまざまな視覚言語推論タスクに適応できることを示します。私たちの実験では、Prismer が現在の最先端のモデルと競合する微調整された少数ショットの学習パフォーマンスを達成する一方で、最大 2 桁少ないトレーニングデータを必要とすることを示しています。コードは https://github.com/NVlabs/prismer で入手できます。

Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of domain experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from readily-available, pre-trained domain experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show that Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-art models, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.

updated: Sun Mar 12 2023 02:30:16 GMT+0000 (UTC)

published: Sat Mar 04 2023 21:22:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト