E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Haiyang Xu; Ming Yan; Chenliang Li; Bin Bi; Songfang Huang; Wenming Xiao; Fei Huang

E2E-VLP: 視覚学習によって強化されたエンドツーエンドの視覚言語事前トレーニング

大規模な画像とテキストのペアに対するビジョン言語事前トレーニング (VLP) は、クロスモーダルダウンストリームタスクで大きな成功を収めています。最も既存の事前トレーニング方法は、主に 2 段階のトレーニング手順を採用しており、最初に事前トレーニングされたオブジェクト検出器を使用して領域ベースの視覚的特徴を抽出し、次にトレーニングする Transformer の入力として画像表現とテキスト埋め込みを連結します。ただし、これらの方法は、一般的なクロスモーダル理解のために特定のオブジェクト検出器のタスク固有の視覚的表現を使用するという問題と、2 ステージパイプラインの計算の非効率性に直面しています。この論文では、V+L の理解と生成の両方のための最初のエンドツーエンドのビジョン言語事前トレーニング済みモデル、つまり E2E-VLP を提案します。このモデルでは、視覚的表現とセマンティックアラインメントを共同で学習するための統合された Transformer フレームワークを構築します。画像とテキストの間。視覚学習を強化するために、統合された Transformer エンコーダーデコーダーアーキテクチャを使用して、オブジェクト検出と画像キャプションのタスクを事前トレーニングに組み込みます。この新しい VLP パラダイムの有効性を実証するために、十分に確立された視覚言語のダウンストリームタスクで広範な実験が行われました。

Vision-language pre-training (VLP) on large-scale image-text pairs has achieved huge success for the cross-modal downstream tasks. The most existing pre-training methods mainly adopt a two-step training procedure, which firstly employs a pre-trained object detector to extract region-based visual features, then concatenates the image representation and text embedding as the input of Transformer to train. However, these methods face problems of using task-specific visual representation of the specific object detector for generic cross-modal understanding, and the computation inefficiency of two-stage pipeline. In this paper, we propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where we build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text. We incorporate the tasks of object detection and image captioning into pre-training with a unified Transformer encoder-decoder architecture for enhancing visual learning. An extensive set of experiments have been conducted on well-established vision-language downstream tasks to demonstrate the effectiveness of this novel VLP paradigm.

updated: Thu Jun 03 2021 12:50:26 GMT+0000 (UTC)

published: Thu Jun 03 2021 12:50:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト