Multilingual Multimodal Learning with Machine Translated Text

Chen Qiu; Dan Oneata; Emanuele Bugliarello; Stella Frank; Desmond Elliott

機械翻訳テキストによる多言語マルチモーダル学習

視覚と言語のプレトレーニング研究のほとんどは、英語の課題に焦点を当てています。ただし、多言語マルチモーダル評価データセット (Multi30K、xGQA、XVNLI、MaRVL など) の作成は、多言語かつマルチモーダルな高品質のトレーニングデータを見つけるという新たな課題をもたらします。この論文では、英語のマルチモーダルデータを機械翻訳することが、すぐに利用できる多言語データの不足を効果的に代用できるかどうかを調査します。このフレームワークを TD-MML: Translated Data for Multilingual Multimodal Learning と呼び、あらゆるマルチモーダルデータセットとモデルに適用できます。最先端のモデルを使用して、事前トレーニングと微調整データの両方に適用します。モデルが低品質の翻訳テキストから学習するのを防ぐために、結果のデータセットからそのような翻訳を自動的に削除するための 2 つのメトリックを提案します。 IGLUE ベンチマークでの 20 の言語にわたる 5 つのタスクの実験では、翻訳されたデータが、事前トレーニングと微調整の両方で、多言語のマルチモーダル学習に役立つシグナルを提供できることを示しています。

Most vision-and-language pretraining research focuses on English tasks. However, the creation of multilingual multimodal evaluation datasets (e.g. Multi30K, xGQA, XVNLI, and MaRVL) poses a new challenge in finding high-quality training data that is both multilingual and multimodal. In this paper, we investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data. We call this framework TD-MML: Translated Data for Multilingual Multimodal Learning, and it can be applied to any multimodal dataset and model. We apply it to both pretraining and fine-tuning data with a state-of-the-art model. In order to prevent models from learning from low-quality translated text, we propose two metrics for automatically removing such translations from the resulting datasets. In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning, both at pretraining and fine-tuning.

updated: Mon Oct 24 2022 11:41:20 GMT+0000 (UTC)

published: Mon Oct 24 2022 11:41:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト