Contrastive Language-Image Pre-training for the Italian Language

Federico Bianchi; Giuseppe Attanasio; Raphael Pisoni; Silvia Terragni; Gabriele Sarti; Sri Lakshmi

対照言語-イタリア語の画像事前トレーニング

CLIP（対照言語-画像事前トレーニング）は、画像とテキストの表現を共同で学習するごく最近のマルチモーダルモデルです。モデルは大量の英語データでトレーニングされており、ゼロショット分類タスクで印象的なパフォーマンスを示しています。同じモデルを別の言語でトレーニングすることは簡単ではありません。他の言語のデータでは不十分な場合があり、モデルは優れたパフォーマンスを保証するためにテキストの高品質な翻訳を必要とするためです。このホワイトペーパーでは、140万を超える画像とテキストのペアでトレーニングされた、イタリア語の最初のCLIPモデル（CLIP-イタリア語）を紹介します。結果は、CLIP-Italianが、画像検索とゼロショット分類のタスクで多言語CLIPモデルよりも優れていることを示しています。

CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model for the Italian Language (CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.

updated: Thu Aug 19 2021 13:53:47 GMT+0000 (UTC)

published: Thu Aug 19 2021 13:53:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト