KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

Yongfei Liu; Chenfei Wu; Shao-yen Tseng; Vasudev Lal; Xuming He; Nan Duan

KD-VLP: オブジェクト知識の蒸留によるエンドツーエンドの視覚と言語の事前トレーニングの改善

自己教師ありのビジョンと言語の事前トレーニング (VLP) は、大規模な画像テキストデータから転送可能なマルチモーダル表現を学習し、微調整後に広範囲のビジョン言語タスクで強力なパフォーマンスを達成することを目的としています。以前の主流の VLP アプローチは通常、外部オブジェクト検出器に依存する 2 段階の戦略を採用して、マルチモーダル Transformer フレームワークで画像をエンコードします。これは、オブジェクトの概念空間の制限、画像コンテキストの制限、および非効率的な計算に悩まされます。このホワイトペーパーでは、CNN から画像グリッド機能を Transformer に直接フィードし、マルチモーダル表現を共同で学習する、オブジェクト認識型のエンドツーエンド VLP フレームワークを提案します。さらに重要なことに、オブジェクト知識の蒸留を実行して、異なるセマンティックレベルでのクロスモーダルアラインメントの学習を促進することを提案します。それを達成するために、外部検出器からオブジェクトの特徴とそのセマンティックラベルを監督として取得することにより、2 つの新しい口実タスクを設計します。 2.) 句領域アラインメントタスクは、言語空間における名詞句とオブジェクトラベル間の類似性を利用することにより、クロスモーダルアラインメントを改善することを目的としています。幅広い視覚言語タスクに関する広範な実験により、提案されたフレームワークの有効性が実証され、既存の事前トレーニング戦略よりも競争力のある、または優れたパフォーマンスが達成されます。

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning. Previous mainstream VLP approaches typically adopt a two-step strategy relying on external object detectors to encode images in a multi-modal Transformer framework, which suffer from restrictive object concept space, limited image context and inefficient computation. In this paper, we propose an object-aware end-to-end VLP framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly. More importantly, we propose to perform object knowledge distillation to facilitate learning cross-modal alignment at different semantic levels. To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision: 1.) Object-guided masked vision modeling task focuses on enforcing object-aware representation learning in the multi-modal Transformer; 2.) Phrase-region alignment task aims to improve cross-modal alignment by utilizing the similarities between noun phrases and object labels in the linguistic space. Extensive experiments on a wide range of vision-language tasks demonstrate the efficacy of our proposed framework, and we achieve competitive or superior performances over the existing pretraining strategies.

updated: Sun Aug 07 2022 18:27:10 GMT+0000 (UTC)

published: Wed Sep 22 2021 03:38:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト