PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen; Xiao Wang; Soravit Changpinyo; AJ Piergiovanni; Piotr Padlewski; Daniel Salz; Sebastian Goodman; Adam Grycner; Basil Mustafa; Lucas Beyer; Alexander Kolesnikov; Joan Puigcerver; Nan Ding; Keran Rong; Hassan Akbari; Gaurav Mishra; Linting Xue; Ashish Thapliyal; James Bradbury; Weicheng Kuo; Mojtaba Seyedhosseini; Chao Jia; Burcu Karagol Ayan; Carlos Riquelme; Andreas Steiner; Anelia Angelova; Xiaohua Zhai; Neil Houlsby; Radu Soricut

PaLI: 共同スケールの多言語画像モデル

効果的なスケーリングと柔軟なタスクインターフェイスにより、大規模な言語モデルが多くのタスクで優れた性能を発揮できるようになります。我々は、このアプローチを言語と視覚の共同モデリングに拡張するモデルである PaLI (Pathways Language and Image model) を紹介します。 PaLI は、視覚的およびテキスト入力に基づいてテキストを生成し、このインターフェイスを使用して、多くの視覚、言語、およびマルチモーダルなタスクを多くの言語で実行します。 PaLI をトレーニングするために、事前にトレーニングされた大規模なエンコーダー/デコーダー言語モデルとビジョントランスフォーマー (ViT) を利用します。これにより、彼らの既存の能力を活用し、彼らのトレーニングにかかる多額のコストを活用することができます。私たちは、視覚と言語のコンポーネントを統合してスケーリングすることが重要であることを発見しました。言語用の既存のトランスフォーマーは、対応するビジョントランスフォーマーよりもはるかに大きいため、40 億という大規模なパラメーター ViT (ViT-e) をトレーニングして、さらに大容量のビジョンモデルからの利点を定量化します。 PaLI をトレーニングするために、100 を超える言語の 100 億個の画像とテキストを含む新しい画像テキストトレーニングセットに基づいて、事前トレーニングタスクの大規模な多言語混合を作成します。 PaLI は、シンプルでモジュール式でスケーラブルな設計を維持しながら、複数の視覚および言語タスク (キャプション、視覚的な質問応答、シーンテキストの理解など) で最先端の機能を実現します。

Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

updated: Sun May 28 2023 23:46:10 GMT+0000 (UTC)

published: Wed Sep 14 2022 17:24:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト