Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Gregor Geigle; Radu Timofte; Goran Glavaš

Babel-ImageNet: 視覚と言語表現の大規模多言語評価

各モダリティ (CLIP など) に個別のエンコーダを備えた視覚と言語 (VL) モデルは、ゼロショット画像分類と画像テキスト検索の頼りになるモデルになっています。ただし、これらのモデルの評価の大部分は英語のテキストのみで実行されます。言語固有の画像キャプションデータセットの作成にコストがかかるため、多言語 VL ベンチマークは少数の高リソース言語に限定されています。この研究では、機械翻訳 (MT) に頼ったり、手動の注釈を必要とせずに構築された、1000 個の ImageNet ラベルの 92 言語への (部分) 翻訳を提供する大規模な多言語ベンチマークである Babel-ImageNet を紹介します。その代わりに、共有 WordNet synsets を介して、大規模な多言語の語彙意味論的ネットワークである BabelNet に ImageNext の概念をリンクすることで、信頼性の高い ImageNext の概念の翻訳を自動的に取得します。 92 の Babel-ImageNet 言語ごとに、ゼロショット画像分類 (ZS-IC) に関する 8 つの異なる公開多言語 CLIP モデルを評価し、英語の ImageNet パフォーマンスと高リソース言語 (ドイツ語やドイツ語など) のパフォーマンスとの間に大きなギャップがあることを実証しました。中国語）、リソースの少ない言語（シンハラ語やラオス語など）ではさらに大きなギャップがあります。重要なのは、Babel-ImageNet 上のモデルの ZS-IC パフォーマンスが画像テキスト検索のパフォーマンスと高度に相関していることを示し、Babel-ImageNet が大多数の言語の多言語 VL 表現空間の品質を推定するのに適していることを検証したことです。ゴールドの画像テキストデータが不足しています。最後に、低リソース言語の多言語 CLIP のパフォーマンスが、安価でパラメータ効率の高い言語固有のトレーニングによって大幅に向上できることを示します。私たちはコードとデータを公開しています: https://github.com/gregor-ge/Babel-ImageNet

Vision-and-language (VL) models with separate encoders for each modality (e.g., CLIP) have become the go-to models for zero-shot image classification and image-text retrieval. The bulk of the evaluation of these models is, however, performed with English text only: the costly creation of language-specific image-caption datasets has limited multilingual VL benchmarks to a handful of high-resource languages. In this work, we introduce Babel-ImageNet, a massively multilingual benchmark that offers (partial) translations of 1000 ImageNet labels to 92 languages, built without resorting to machine translation (MT) or requiring manual annotation. We instead automatically obtain reliable translations of ImageNext concepts by linking them -- via shared WordNet synsets -- to BabelNet, a massively multilingual lexico-semantic network. We evaluate 8 different publicly available multilingual CLIP models on zero-shot image classification (ZS-IC) for each of the 92 Babel-ImageNet languages, demonstrating a significant gap between English ImageNet performance and that of high-resource languages (e.g., German or Chinese), and an even bigger gap for low-resource languages (e.g., Sinhala or Lao). Crucially, we show that the models' ZS-IC performance on Babel-ImageNet highly correlates with their performance in image-text retrieval, validating that Babel-ImageNet is suitable for estimating the quality of the multilingual VL representation spaces for the vast majority of languages that lack gold image-text data. Finally, we show that the performance of multilingual CLIP for low-resource languages can be drastically improved via cheap, parameter-efficient language-specific training. We make our code and data publicly available: https://github.com/gregor-ge/Babel-ImageNet

updated: Wed Jun 14 2023 17:53:06 GMT+0000 (UTC)

published: Wed Jun 14 2023 17:53:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト