PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen; Xiao Wang; Soravit Changpinyo; AJ Piergiovanni; Piotr Padlewski; Daniel Salz; Sebastian Goodman; Adam Grycner; Basil Mustafa; Lucas Beyer; Alexander Kolesnikov; Joan Puigcerver; Nan Ding; Keran Rong; Hassan Akbari; Gaurav Mishra; Linting Xue; Ashish Thapliyal; James Bradbury; Weicheng Kuo; Mojtaba Seyedhosseini; Chao Jia; Burcu Karagol Ayan; Carlos Riquelme; Andreas Steiner; Anelia Angelova; Xiaohua Zhai; Neil Houlsby; Radu Soricut

PaLI: 共同でスケーリングされた多言語言語イメージモデル

効果的なスケーリングと柔軟なタスクインターフェイスにより、大規模な言語モデルは多くのタスクで優れた性能を発揮します。PaLI (PathwaysLanguage andImage モデル) は、このアプローチを言語と視覚の共同モデリングに拡張します。 PaLI は、視覚的およびテキスト入力に基づいてテキストを生成し、このインターフェイスを使用して、多くの言語で多くのビジョン、言語、およびマルチモーダルタスクを実行します。 PaLI をトレーニングするために、大規模な事前トレーニング済みのエンコーダー/デコーダー言語モデルとビジョントランスフォーマー (ViT) を利用します。これにより、彼らの既存の能力を活用し、彼らをトレーニングするための実質的なコストを活用することができます。視覚と言語コンポーネントの共同スケーリングが重要であることがわかりました。言語用の既存のトランスフォーマーは、対応するビジョンよりもはるかに大きいため、これまでで最大の ViT (ViT-e) をトレーニングして、さらに大容量のビジョンモデルからのメリットを定量化します。 PaLI をトレーニングするために、100 を超える言語の 100 億の画像とテキストを含む新しい画像とテキストのトレーニングセットに基づいて、事前トレーニングタスクの大規模な多言語ミックスを作成します。 PaLI は、シンプルでモジュール化されたスケーラブルな設計を維持しながら、複数の視覚および言語タスク (キャプション、視覚的な質問応答、シーンテキストの理解など) で最先端を実現します。

Effective scaling and a flexible task interface enable large language models to excel at many tasks.PaLI(PathwaysLanguage andImage model) extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

updated: Wed Sep 14 2022 17:24:07 GMT+0000 (UTC)

published: Wed Sep 14 2022 17:24:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト