Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases

Zhihao Yuan; Xu Yan; Zhuo Li; Xuhao Li; Yao Guo; Shuguang Cui; Zhen Li

テキストのフレーズを参照することで、説明可能できめ細かい 3D グラウンディングを目指す

3D シーンの理解における最近の進歩では、言語記述を通じてターゲットオブジェクトのローカライズを行うビジュアルグラウンディング (3DVG) が検討されています。しかし、既存の手法では、文全体と対象オブジェクト間の依存関係のみが考慮され、コンテキストと対象外のコンテキスト間の詳細な関係は無視されています。このペーパーでは、3DVG を 3D Phrase Aware Grounding (3DPAG) と呼ばれる、よりきめ細かく解釈可能なタスクに拡張します。 3DPAG タスクは、すべてのフレーズ関連オブジェクトを明示的に識別し、文脈上のフレーズに従って推論を実行することで、3D シーン内のターゲットオブジェクトの位置を特定することを目的としています。この問題に取り組むために、広く使用されている 3DVG データセット (Nr3D、Sr3D、ScanRefer) の 88,000 文から、自社開発プラットフォームを使用して約 227,000 のフレーズレベルのアノテーションを手動でラベル付けしました。データセットをタップすることで、以前の 3DVG 手法をきめ細かいフレーズ認識シナリオに拡張できます。これは、提案された新しいフレーズとオブジェクトの配置の最適化とフレーズ固有の事前トレーニングによって実現され、従来の 3DVG パフォーマンスも向上します。広範な結果により、大幅な改善が確認されています。つまり、以前の最先端の方法では、Nr3D、Sr3D、および ScanRefer でそれぞれ 3.9%、3.5%、および 4.6% の全体的な精度の向上が達成されています。

Recent progress in 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target object, ignoring fine-grained relationships between contexts and non-target ones. In this paper, we extend 3DVG to a more fine-grained and interpretable task, called 3D Phrase Aware Grounding (3DPAG). The 3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases. To tackle this problem, we manually labeled about 227K phrase-level annotations using a self-developed platform, from 88K sentences of widely used 3DVG datasets, i.e., Nr3D, Sr3D and ScanRefer. By tapping on our datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware scenario. It is achieved through the proposed novel phrase-object alignment optimization and phrase-specific pre-training, boosting conventional 3DVG performance as well. Extensive results confirm significant improvements, i.e., previous state-of-the-art method achieves 3.9%, 3.5% and 4.6% overall accuracy gains on Nr3D, Sr3D and ScanRefer respectively.

updated: Sat May 27 2023 10:03:34 GMT+0000 (UTC)

published: Tue Jul 05 2022 05:50:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト