Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Yan Zeng; Xinsong Zhang; Hang Li

マルチグレインビジョン言語の事前トレーニング：テキストとビジュアルコンセプトの調整

視覚言語の事前トレーニングにおける既存の方法のほとんどは、オブジェクト検出によって抽出されたオブジェクト中心の特徴に依存し、抽出された特徴とテキストの間できめ細かい位置合わせを行います。これらのメソッドが複数のオブジェクト間の関係を学習することは困難です。この目的のために、「マルチグレインビジョン言語の事前トレーニング」を実行するためのX-VLMと呼ばれる新しい方法を提案します。マルチグレインアラインメントを学習するための鍵は、関連するテキストを指定して画像内の視覚的概念を特定し、その間に、アラインメントがマルチグラニュラリティである視覚的概念にテキストをアラインメントすることです。実験結果は、X-VLMが学習したマルチグレインアライメントを多くのダウンストリームビジョン言語タスクに効果的に活用し、常に最先端の方法を上回っていることを示しています。

Most existing methods in vision language pre-training rely on object-centric features extracted through object detection and make fine-grained alignments between the extracted features and texts. It is challenging for these methods to learn relations among multiple objects. To this end, we propose a new method called X-VLM to perform `multi-grained vision language pre-training.' The key to learning multi-grained alignments is to locate visual concepts in the image given the associated texts, and in the meantime align the texts with the visual concepts, where the alignments are in multi-granularity. Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.

updated: Wed Jun 01 2022 16:45:09 GMT+0000 (UTC)

published: Tue Nov 16 2021 07:55:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト