UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

Wei Li; Can Gao; Guocheng Niu; Xinyan Xiao; Hao Liu; Jiachen Liu; Hua Wu; Haifeng Wang

UNIMO-2：エンドツーエンドの統一されたビジョン-言語に基づく学習

Vision-Language Pre-training（VLP）は、さまざまなクロスモーダルダウンストリームタスクで優れたパフォーマンスを実現しました。ただし、ほとんどの既存の方法は、位置合わせされた画像キャプションデータからしか学習できず、高価な地域機能に大きく依存しているため、スケーラビリティとパフォーマンスが大幅に制限されます。この論文では、位置合わせされた画像キャプションデータと位置合わせされていない画像のみおよびテキストのみのコーパスの両方で共同学習するためのエンドツーエンドの統合モーダル事前トレーニングフレームワーク、すなわちUNIMO-2を提案します。統合されたTransformerモデルを構築して、視覚的表現、テキスト表現、および画像とテキスト間のセマンティックアラインメントを共同で学習します。特に、非整列の画像とテキストを橋渡しし、さまざまなタイプのコーパスの視覚的およびテキストの意味空間を整列させるのに役立つ共有接地空間を介して、画像とテキストの両方で接地学習を行うことを提案します。実験は、私たちの根拠のある学習方法が、さまざまなクロスモーダルタスクのパフォーマンスを改善するために、テキストと視覚の意味的整合を改善できることを示しています。さらに、さまざまなタイプのコーパスの効果的なジョイントモデリングの恩恵を受けて、私たちのモデルは、シングルモーダルの視覚的およびテキストタスクでも印象的なパフォーマンスを実現します。私たちのコードとモデルは、UNIMOプロジェクトページhttps://unimo-ptm.github.io/で公開されています。

Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily on expensive regional features, which greatly limits their scalability and performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only and text-only corpus. We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts. In particular, we propose to conduct grounded learning on both images and texts via a sharing grounded space, which helps bridge unaligned images and texts, and align the visual and textual semantic spaces on different types of corpora. The experiments show that our grounded learning method can improve textual and visual semantic alignment for improving performance on various cross-modal tasks. Moreover, benefiting from effective joint modeling of different types of corpora, our model also achieves impressive performance on single-modal visual and textual tasks. Our code and models are public at the UNIMO project page https://unimo-ptm.github.io/.

updated: Thu Mar 17 2022 03:53:11 GMT+0000 (UTC)

published: Thu Mar 17 2022 03:53:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト