ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance

Zoey Guo; Yiwen Tang; Ray Zhang; Dong Wang; Zhigang Wang; Bin Zhao; Xuelong Li

ViewRefer: GPT とプロトタイプのガイダンスを使用して 3D 視覚基礎のためのマルチビューの知識を把握する

マルチビュー入力から 3D シーンを理解すると、3D 視覚的基礎におけるビューの不一致の問題が軽減されることが証明されています。しかし、既存の方法は通常、テキストモダリティに埋め込まれたビューキューを無視し、さまざまなビューの相対的な重要性を比較検討できません。この論文では、テキストと 3D モダリティの両方からビューの知識を把握する方法を探求する 3D 視覚基盤のためのマルチビューフレームワークである ViewRefer を提案します。テキストブランチの場合、ViewRefer は、GPT などの大規模言語モデルの多様な言語知識を活用して、単一の基礎テキストを複数のジオメトリ一貫性のある記述に拡張します。一方、3D モダリティでは、ビュー間のオブジェクトの相互作用を強化するために、ビュー間アテンションを備えたトランスフォーマーフュージョンモジュールが導入されています。その上で、学習可能なマルチビュープロトタイプのセットをさらに提示します。これは、さまざまなビューのシーンに依存しない知識を記憶し、より堅牢なテキスト機能のためのビューガイド付き注意モジュールとビューという 2 つの観点からフレームワークを強化します。 - 最終予測中のガイド付きスコア戦略。私たちが設計したパラダイムにより、ViewRefer は 3 つのベンチマークで優れたパフォーマンスを達成し、Sr3D、Nr3D、ScanRefer で 2 番目に優れたベンチマークを +2.8%、+1.5%、+1.35% 上回りました。

Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer.

updated: Thu Aug 24 2023 18:45:43 GMT+0000 (UTC)

published: Wed Mar 29 2023 17:59:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト