Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases

Zhihao Yuan; Xu Yan; Zhuo Li; Xuhao Li; Yao Guo; Shuguang Cui; Zhen Li

テキストフレーズの参照による説明可能できめの細かい3Dグラウンディングに向けて

3Dシーンの理解に関する最近の進歩により、言語の説明を通じてターゲットオブジェクトをローカライズするためのビジュアルグラウンディング（3DVG）が検討されています。ただし、既存のメソッドは、文全体とターゲットオブジェクト間の依存関係のみを考慮しているため、コンテキストと非ターゲットオブジェクトの間のきめ細かい関係は無視されます。このホワイトペーパーでは、3DVGを、3D Phrase Aware Grounding（3DPAG）と呼ばれるより信頼性が高く説明可能なタスクに拡張します。 3DPAGタスクは、すべてのフレーズ関連オブジェクトを明示的に識別し、コンテキストフレーズに従って推論を実行することにより、3Dシーンでターゲットオブジェクトをローカライズすることを目的としています。この問題に取り組むために、利用可能な3DVGデータセット（Nr3D、Sr3D、ScanRefer）の170K文から約400Kのフレーズレベルの注釈にラベルを付けます。これらの開発されたデータセットを利用することにより、新しいフレームワーク、つまりPhraseReferを提案します。これは、フレーズオブジェクトの配置の最適化とフレーズ固有の事前トレーニングを通じて、フレーズ認識およびオブジェクトレベルの表現学習を実行します。私たちの設定では、以前の3DVGメソッドをフレーズ認識シナリオに拡張し、3DPAGタスクの説明可能性を測定するためのメトリックを提供します。広範な結果により、3DPAGが3DVGを効果的にブーストし、PhraseReferが3つのデータセット、つまりSr3D、Nr3D、ScanReferでそれぞれ63.0％、54.4％、55.5％の全体的な精度を達成していることが確認されました。

Recent progress on 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target object, thus ignoring fine-grained relationships between contexts and non-target ones. In this paper, we extend 3DVG to a more reliable and explainable task, called 3D Phrase Aware Grounding (3DPAG). The 3DPAG task aims to localize the target object in the 3D scenes by explicitly identifying all phrase-related objects and then conducting reasoning according to contextual phrases. To tackle this problem, we label about 400K phrase-level annotations from 170K sentences in available 3DVG datasets, i.e., Nr3D, Sr3D and ScanRefer. By tapping on these developed datasets, we propose a novel framework, i.e., PhraseRefer, which conducts phrase-aware and object-level representation learning through phrase-object alignment optimization as well as phrase-specific pre-training. In our setting, we extend previous 3DVG methods to the phrase-aware scenario and provide metrics to measure the explainability of the 3DPAG task. Extensive results confirm that 3DPAG effectively boosts the 3DVG, and PhraseRefer achieves state-of-the-arts across three datasets, i.e., 63.0%, 54.4% and 55.5% overall accuracy on Sr3D, Nr3D and ScanRefer, respectively.

updated: Tue Jul 05 2022 05:50:12 GMT+0000 (UTC)

published: Tue Jul 05 2022 05:50:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト