cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

Kshitij Gupta; Devansh Gautam; Radhika Mamidi

cViL：知識蒸留を使用した視覚言語モデルのクロスリンガルトレーニング

ビジョンと言語のタスクは研究コミュニティで人気を集めていますが、焦点は依然として主に英語です。英語のみの視覚言語モデルを利用して、ターゲット言語の単一言語モデルをトレーニングするパイプラインを提案します。画像とテキストの配置を学習するためのアンカーポイントとしてオブジェクトタグを活用するモデルであるOSCAR+を拡張して、さまざまな言語の視覚的な質問応答データセットをトレーニングすることを提案します。並列文を使用して他の言語でモデルをトレーニングするための知識蒸留への新しいアプローチを提案します。事前トレーニングコーパスでターゲット言語を使用する他のモデルと比較して、既存の英語モデルを活用して、大幅に少ないリソースを使用して知識をターゲット言語に転送できます。また、日本語とヒンディー語の大規模な視覚的質問応答データセットをリリースします。作業は視覚的な質問応答に限定されていますが、モデルは任意のシーケンスレベルの分類タスクに拡張でき、他の言語にも拡張できます。このペーパーでは、視覚的な質問応答タスクの2つの言語（日本語とヒンディー語）に焦点を当てます。私たちのパイプラインは、現在の最先端モデルよりも、精度がそれぞれ4.4％と13.4％向上しています。

Vision-and-language tasks are gaining popularity in the research community, but the focus is still mainly on English. We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We propose to extend OSCAR+, a model which leverages object tags as anchor points for learning image-text alignments, to train on visual question answering datasets in different languages. We propose a novel approach to knowledge distillation to train the model in other languages using parallel sentences. Compared to other models that use the target language in the pretraining corpora, we can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources. We also release a large-scale visual question answering dataset in Japanese and Hindi language. Though we restrict our work to visual question answering, our model can be extended to any sequence-level classification task, and it can be extended to other languages as well. This paper focuses on two languages for the visual question answering task - Japanese and Hindi. Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4% and 13.4% respectively in accuracy.

updated: Thu Jun 09 2022 05:40:02 GMT+0000 (UTC)

published: Tue Jun 07 2022 14:46:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト