Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

Yan Zeng; Wangchunshu Zhou; Ao Luo; Xinsong Zhang

クロスビュー言語モデリング：統一されたクロスリンガルクロスモーダル事前トレーニングに向けて

このホワイトペーパーでは、クロスビュー言語モデリングを紹介します。これは、言語間クロスモーダル事前トレーニングを共有アーキテクチャと目的と統合する、シンプルで効果的な言語モデル事前トレーニングフレームワークです。私たちのアプローチは、クロスリンガルおよびクロスモーダルの事前トレーニングが、同じオブジェクトの2つの異なるビューを共通のセマンティックスペースに整列させるという同じ目標を共有するという重要な観察によって動機付けられています。この目的のために、クロスビュー言語モデリングフレームワークは、マルチモーダルデータ（つまり、画像とキャプションのペア）とマルチリンガルデータ（つまり、並列文のペア）の両方を、同じオブジェクトの2つの異なるビューと見なし、モデルをトレーニングします。条件付きマスク言語モデリングと対照的な学習を使用して、2つのビュー間の相互情報量を最大化することにより、2つのビューを調整します。クロスビュー言語モデリングフレームワークを使用して、クロスリンガルクロスモーダル言語モデルであるCCLMを事前トレーニングします。 IGLUE、多言語マルチモーダルベンチマーク、および2つの多言語画像テキスト検索データセットに関する経験的結果は、概念的には単純ですが、CCLMは以前の最先端技術を大幅に上回り、平均絶対改善が10を超えることを示しています。％。特に、CCLMは、ゼロショットのクロスリンガル転送によって、代表的な英語のビジョン言語モデルの翻訳テストのパフォーマンスを超える最初のマルチリンガルマルチモーダルモデルです。

In this paper, we introduce Cross-View Language Modeling, a simple and effective language model pre-training framework that unifies cross-lingual cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view language modeling framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked language modeling and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal Language Model, with the cross-view language modeling framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval datasets show that while conceptually simpler, CCLM significantly outperforms the prior state-of-the-art with an average absolute improvement of over 10%. Notably, CCLM is the first multi-lingual multi-modal model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.

updated: Wed Jun 01 2022 16:45:24 GMT+0000 (UTC)

published: Wed Jun 01 2022 16:45:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト