Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Hengcan Shi; Munawar Hayat; Jianfei Cai

双方向クロスモーダルマッチングによる対になっていない参照式の接地

表現の根拠を参照することは、コンピュータービジョンにおいて重要で挑戦的なタスクです。従来の参照接地での面倒な注釈を回避するために、ペアになっていない参照接地が導入されました。トレーニングデータには、対応のない多数の画像とクエリのみが含まれています。画像とテキストのマッチングを学習するという課題と、ペアになっていないデータを使用したトップダウンガイダンスがないため、ペアになっていない参照接地に対するいくつかの既存のソリューションはまだ予備的なものです。この論文では、これらの課題に対処するための新しい双方向クロスモーダルマッチング（BiCM）フレームワークを提案します。特に、クエリ固有の視覚的注意マップを生成することでトップダウンの視点を導入するクエリ認識注意マップ（QAM）モジュールを設計します。クロスモーダルオブジェクトマッチング（COM）モジュールがさらに導入されました。このモジュールは、最近登場した画像とテキストのマッチングの事前トレーニング済みモデルであるCLIPを活用して、ボトムアップの観点からターゲットオブジェクトを予測します。次に、トップダウンとボトムアップの予測が、類似性関数（SF）モジュールを介して統合されます。また、ペアになっていないトレーニングデータを活用して、事前にトレーニングされた知識をターゲットデータセットとタスクに適応させる知識適応マッチング（KAM）モジュールを提案します。実験によると、2つの一般的な接地データセットで、フレームワークが以前の作業を6.55％および9.94％上回っています。

Referring expression grounding is an important and challenging task in computer vision. To avoid the laborious annotation in conventional referring grounding, unpaired referring grounding is introduced, where the training data only contains a number of images and queries without correspondences. The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning image-text matching and lack of the top-down guidance with unpaired data. In this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps. A cross-modal object matching (COM) module is further introduced, which exploits the recently emerged image-text matching pretrained model, CLIP, to predict the target objects from a bottom-up perspective. The top-down and bottom-up predictions are then integrated via a similarity funsion (SF) module. We also propose a knowledge adaptation matching (KAM) module that leverages unpaired training data to adapt pretrained knowledge to the target dataset and task. Experiments show that our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.

updated: Tue Jan 18 2022 01:13:19 GMT+0000 (UTC)

published: Tue Jan 18 2022 01:13:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト