Position-guided Text Prompt for Vision-Language Pre-training

Alex Jinpeng Wang; Pan Zhou; Mike Zheng Shou; Shuicheng Yan

視覚言語事前トレーニングのための位置ガイド付きテキストプロンプト

Vision-Language Pre-Training (VLP) は、画像とテキストのペアを整列させる有望な機能を示しており、さまざまなクロスモーダル学習タスクを促進します。ただし、VLP モデルには、視覚的な推論などの多くのダウンストリームタスクにとって重要な視覚的なグラウンディング/ローカリゼーション機能が欠けていることがよくあります。この作業では、VLP でトレーニングされたクロスモーダルモデルの視覚的接地能力を強化するために、新しい位置ガイド付きテキストプロンプト (PTP) パラダイムを提案します。具体的には、VLP フェーズでは、PTP は画像を N×N ブロックに分割し、VLP で広く使用されているオブジェクト検出器を通じて各ブロック内のオブジェクトを識別します。次に、与えられたブロック内のオブジェクトを予測するか、与えられたオブジェクトのブロックを回帰するようにモデルを奨励することによって、与えられた PTP を与えられた空白埋め問題に視覚的なグラウンディングタスクを再定式化します。 " in aPTP "The block P has a O". このメカニズムにより、VLP モデルの視覚的なグラウンディング機能が向上し、さまざまなダウンストリームタスクをより適切に処理できるようになります。PTP をいくつかの最先端の VLP フレームワークに導入することにより、次のことが観察されます。代表的なクロスモーダル学習モデルアーキテクチャといくつかのベンチマークで一貫して大幅な改善が見られます。たとえば、ViLT vilt ベースラインではゼロショット Flickr30K 検索 (平均リコール @1 で +4.8)、SOTA BLIP ブリップベースラインでは COCO キャプション (CIDEr で +5.3) が挙げられます。さらに、PTP はオブジェクト検出器ベースの方法で同等の結果を達成し、PTP はオブジェクト検出器を破棄して推論を破棄するため、推論速度がはるかに高速になります。セイルズg/ptp。

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into N×N blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT vilt baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP blip baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at https://github.com/sail-sg/ptp.

updated: Mon Dec 19 2022 18:55:43 GMT+0000 (UTC)

published: Mon Dec 19 2022 18:55:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト