mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Gregor Geigle; Abhay Jain; Radu Timofte; Goran Glavaš

mBLIP: 多言語ビジョン LLM の効率的なブートストラップ

モジュール式ビジョン言語モデル (Vision-LLM) は、事前トレーニングされた画像エンコーダーを (事前トレーニングされた) 大規模言語モデル (LLM) と連携させます。これは、大規模なビジョン言語モデルを最初からエンドツーエンドでトレーニングするよりも、計算的にはるかに効率的な代替手段を表します。ほとんどの人にとって法外に高価です。代わりに、Vision-LLM は、LLM を事後的に調整して、画像エンコーダの出力を「理解」します。容易に入手できる高品質な英語の画像テキストデータと単言語英語 LLM が豊富にあるため、研究の焦点は英語のみの Vision-LLM に置かれています。多言語視覚言語モデルは依然として高価なエンドツーエンドの事前トレーニングを介して取得されることが多く、その結果、テキストのみの多言語コーパスで補われた限られた多言語画像データでトレーニングされた比較的小さなモデルが得られます。この研究では、最初の多言語 Vision-LLM である mBLIP を紹介します。これは、事前トレーニング済みの多言語 LLM を活用することで、数百万のトレーニングサンプルのみを使用して、コンシューマーハードウェア上で計算効率の高い方法で取得されます。この目的を達成するために、以前は英語 LLM に調整されていた画像エンコーダを新しい多言語 LLM に再調整します。このために、視覚と言語のタスクを組み合わせた多言語データを活用し、機械翻訳によって取得します。高品質の英語データを 95 言語に対応。 IGLUE ベンチマークでは、mBLIP は最先端のモデルと競合する結果をもたらします。さらに、XM3600 の画像キャプションでは、mBLIP (ゼロショット) が PaLI-X (パラメータ 55B のモデル) よりも優れています。ゼロからトレーニングされたこれらの非常に大規模な多言語視覚言語モデルと比較して、我々は、桁違いに少ないパラメータ、より少ないデータでトレーニングすることによって mBLIP を取得します。モデルとコードは https://github.com/gregor-ge/mBLIP でリリースされています。

Modular vision-language models (Vision-LLMs) align pretrained image encoders with (pretrained) large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we re-align an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at https://github.com/gregor-ge/mBLIP.

updated: Thu Jul 13 2023 17:51:58 GMT+0000 (UTC)

published: Thu Jul 13 2023 17:51:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト