Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

Yan Zeng; Wangchunshu Zhou; Ao Luo; Ziming Cheng; Xinsong Zhang

クロスビュー言語モデリング: 統合されたクロスリンガル、クロスモーダルの事前トレーニングに向けて

このペーパーでは、クロスビュー言語モデリングを紹介します。これは、共通のアーキテクチャと目標を使用して、クロスリンガルおよびクロスモーダルの事前トレーニングを統合する、シンプルで効果的な事前トレーニングフレームワークです。私たちのアプローチは、クロスリンガルおよびクロスモーダルの事前トレーニングが、同じオブジェクトの 2 つの異なるビューを共通の意味論的空間に調整するという同じ目標を共有しているという重要な観察によって動機付けられています。この目的を達成するために、クロスビュー言語モデリングフレームワークは、マルチモーダルデータ (つまり、画像とキャプションのペア) と多言語データ (つまり、並列文のペア) の両方を同じオブジェクトの 2 つの異なるビューとして考慮し、モデルをトレーニングします。条件付きマスク言語モデリングと対照学習を使用して、2 つのビュー間の相互情報を最大化することで、2 つのビューを調整します。クロスビュー言語モデリングフレームワークを使用して、クロスリンガルクロスモーダル言語モデルである CCLM を事前トレーニングします。多言語マルチモーダルベンチマークである IGLUE と 2 つの多言語画像テキスト検索データセットの実証結果は、CCLM が概念的には単純であるにもかかわらず、平均絶対改善率 10 以上で従来の最先端技術を大幅に上回っていることを示しています。 %。さらに、CCLM は、ゼロショットの言語間転送により、代表的な英語視覚言語モデルの翻訳テストのパフォーマンスを上回る、初の多言語マルチモーダル事前トレーニング済みモデルです。

In this paper, we introduce Cross-View Language Modeling, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view language modeling framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked language modeling and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal Language Model, with the cross-view language modeling framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval datasets show that while conceptually simpler, CCLM significantly outperforms the prior state-of-the-art with an average absolute improvement of over 10%. Moreover, CCLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.

updated: Mon Jun 12 2023 12:47:16 GMT+0000 (UTC)

published: Wed Jun 01 2022 16:45:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト