Focusing On Targets For Improving Weakly Supervised Visual Grounding

Viet-Quoc Pham; Nao Mishima

弱い監督下の視覚的グラウンディングを改善するための目標に焦点を当てる

弱教師付きビジュアルグラウンディングは、特定の言語クエリに対応する画像内の領域を予測することを目的としています。この場合、ターゲットオブジェクトとクエリの間のマッピングはトレーニング段階では不明です。最先端の方法では、ビジョン言語の事前トレーニングモデルを使用して Grad-CAM からヒートマップを取得し、すべてのクエリワードを画像領域と照合し、組み合わせたヒートマップを使用して領域の提案をランク付けします。この論文では、このアプローチを改善するための 2 つのシンプルだが効率的な方法を提案します。まず、モデルがオブジェクトレベルとシーンレベルの両方のセマンティック表現を学習することを促進するために、ターゲットを意識したクロッピングアプローチを提案します。次に、依存関係解析を適用して対象オブジェクトに関連する単語を抽出し、ヒートマップの組み合わせでこれらの単語に重点を置きます。私たちの方法は、RefCOCO、RefCOCO+、およびRefCOCOgの以前のSOTA方法を大幅に上回っています。

Weakly supervised visual grounding aims to predict the region in an image that corresponds to a specific linguistic query, where the mapping between the target object and query is unknown in the training stage. The state-of-the-art method uses a vision language pre-training model to acquire heatmaps from Grad-CAM, which matches every query word with an image region, and uses the combined heatmap to rank the region proposals. In this paper, we propose two simple but efficient methods for improving this approach. First, we propose a target-aware cropping approach to encourage the model to learn both object and scene level semantic representations. Second, we apply dependency parsing to extract words related to the target object, and then put emphasis on these words in the heatmap combination. Our method surpasses the previous SOTA methods on RefCOCO, RefCOCO+, and RefCOCOg by a notable margin.

updated: Wed Feb 22 2023 10:02:21 GMT+0000 (UTC)

published: Wed Feb 22 2023 10:02:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト