A Real-Time Cross-modality Correlation Filtering Method for Referring Expression Comprehension

Yue Liao; Si Liu; Guanbin Li; Fei Wang; Yanjie Chen; Chen Qian; Bo Li

表現理解を参照するためのリアルタイムのクロスモダリティ相関フィルタリング法

参照表現の理解は、自然言語表現で記述されたオブジェクトインスタンスをローカライズすることを目的としています。現在の参照式メソッドは、優れたパフォーマンスを実現しています。ただし、精度を落とさずにリアルタイムの推論を行うことはできません。推論速度が比較的遅い理由は、これらのメソッドが参照式の理解を人為的に提案生成と提案ランキングを含む2つの連続する段階に分割するためです。それは人間の認識の習慣に正確には適合していません。この目的のために、我々は新しいリアルタイム相互モダリティ相関フィルタリング法（RCCF）を提案します。 RCCFは、参照式の理解を相関フィルタリングプロセスとして再定式化します。式は、最初に言語ドメインから視覚ドメインにマップされ、次にテンプレート（カーネル）として扱われて、画像の特徴マップに対して相関フィルタリングを実行します。相関ヒートマップのピーク値は、ターゲットボックスの中心点を示します。さらに、RCCFは、2Dオブジェクトサイズと2Dオフセットも回帰します。中心点の座標、オブジェクトのサイズ、および中心点のオフセットを一緒にして、ターゲットの境界ボックスを形成します。私たちの手法は、RefClef、RefCOCO、RefCOCO +、RefCOCOgベンチマークで優れたパフォーマンスを達成しながら、40 FPSで実行されます。挑戦的なRefClefデータセットでは、私たちの方法は最先端のパフォーマンスをほぼ2倍にします（34.70％から63.79％に増加）。この作業が、新しいモダリティ間相関フィルタリングフレームワークだけでなく、表現理解を参照するための1段階のフレームワークへの注目と研究をさらに喚起することを願っています。

Referring expression comprehension aims to localize the object instance described by a natural language expression. Current referring expression methods have achieved good performance. However, none of them is able to achieve real-time inference without accuracy drop. The reason for the relatively slow inference speed is that these methods artificially split the referring expression comprehension into two sequential stages including proposal generation and proposal ranking. It does not exactly conform to the habit of human cognition. To this end, we propose a novel Realtime Cross-modality Correlation Filtering method (RCCF). RCCF reformulates the referring expression comprehension as a correlation filtering process. The expression is first mapped from the language domain to the visual domain and then treated as a template (kernel) to perform correlation filtering on the image feature map. The peak value in the correlation heatmap indicates the center points of the target box. In addition, RCCF also regresses a 2-D object size and 2-D offset. The center point coordinates, object size and center point offset together to form the target bounding box. Our method runs at 40 FPS while achieving leading performance in RefClef, RefCOCO, RefCOCO+ and RefCOCOg benchmarks. In the challenging RefClef dataset, our methods almost double the state-of-the-art performance (34.70% increased to 63.79%). We hope this work can arouse more attention and studies to the new cross-modality correlation filtering framework as well as the one-stage framework for referring expression comprehension.

updated: Mon Apr 27 2020 03:50:23 GMT+0000 (UTC)

published: Mon Sep 16 2019 09:01:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト