InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

Zhihao Yuan; Xu Yan; Yinghong Liao; Ruimao Zhang; Sheng Wang; Zhen Li; Shuguang Cui

InstanceRefer：インスタンスのマルチレベルコンテキスト参照による点群の視覚的接地のための協調的全体論的理解

2D画像の視覚的根拠と比較して、点群での自然言語に基づく3Dオブジェクトのローカリゼーションはより困難です。この論文では、InstanceReferという名前の新しいモデルを提案し、マッチングによる接地戦略を通じて優れた3D視覚的接地を実現します。実際には、私たちのモデルは、最初に、単純な言語分類モデルを使用して、言語の説明からターゲットカテゴリを予測します。次に、カテゴリに基づいて、モデルは点群のパノプティコンセグメンテーションから少数のインスタンス候補（通常は20未満）を選別します。したがって、インスタンスレベルの候補が冗長な3Dオブジェクトの提案よりも合理的であることを考慮して、重要な3Dビジュアルグラウンディングタスクは、単純化されたインスタンスマッチング問題として効果的に再定式化されました。続いて、候補ごとに、マルチレベルのコンテキスト推論を実行します。つまり、インスタンス属性の認識、インスタンスからインスタンスへの関係の認識、インスタンスからバックグラウンドへのグローバルローカリゼーションの認識からそれぞれ参照します。最終的に、最も関連性の高い候補が選択され、信頼スコアをランク付けすることによってローカライズされます。信頼スコアは、協調的な全体的な視覚言語機能のマッチングによって取得されます。実験により、私たちの方法がScanReferオンラインベンチマークおよびNr3D / Sr3Dデータセットの以前の最先端技術を上回っていることを確認しています。

Compared with the visual grounding on 2D images, the natural-language-guided 3D object localization on point clouds is more challenging. In this paper, we propose a new model, named InstanceRefer, to achieve a superior 3D visual grounding through the grounding-by-matching strategy. In practice, our model first predicts the target category from the language descriptions using a simple language classification model. Then, based on the category, our model sifts out a small number of instance candidates (usually less than 20) from the panoptic segmentation of point clouds. Thus, the non-trivial 3D visual grounding task has been effectively re-formulated as a simplified instance-matching problem, considering that instance-level candidates are more rational than the redundant 3D object proposals. Subsequently, for each candidate, we perform the multi-level contextual inference, i.e., referring from instance attribute perception, instance-to-instance relation perception, and instance-to-background global localization perception, respectively. Eventually, the most relevant candidate is selected and localized by ranking confidence scores, which are obtained by the cooperative holistic visual-language feature matching. Experiments confirm that our method outperforms previous state-of-the-arts on ScanRefer online benchmark and Nr3D/Sr3D datasets.

updated: Thu Jul 29 2021 08:51:14 GMT+0000 (UTC)

published: Mon Mar 01 2021 16:59:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト