HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

Jiaming Chen; Weixin Luo; Xiaolin Wei; Lin Ma; Wei Zhang

HAM: 3D ビジュアルグラウンディングのための高性能な階層的注意モデル

このホワイトペーパーでは、新しい挑戦的なビジョン言語タスク、つまり点群の 3D ビジュアルグラウンディングに取り組みます。最近の多くの作品は、よく知られているアテンションメカニズムを備えた Transformer の恩恵を受けており、このタスクに大きなブレークスルーをもたらしています。ただし、さまざまな事前トレーニングや多段階処理を使用することで、達成を実現していることがわかります。パイプラインを簡素化するために、3D ビジュアルの基礎を慎重に調査し、このタスクで高性能なエンドツーエンドモデルを開発する方法に関する 3 つの基本的な問題をまとめます。これらの問題に対処するために、特定のテキストとマルチモーダルな視覚入力の両方に対して、マルチグラニュラリティ表現と効率的な拡張を提供する、新しい階層的注意モデル (HAM) を特に導入します。広範な実験結果は、提案されたHAMモデルの優位性を示しています。具体的には、HAM は大規模な ScanRefer チャレンジで第 1 位にランクされ、既存のすべての方法を大幅に上回っています。コードは承認後にリリースされます。

This paper tackles an emerging and challenging vision-language task, namely 3D visual grounding on point clouds. Many recent works benefit from Transformer with the well-known attention mechanism, leading to a tremendous breakthrough for this task. However, we find that they realize the achievement by using various pre-training or multi-stage processing. To simplify the pipeline, we carefully investigate 3D visual grounding and summarize three fundamental problems about how to develop an end-to-end model with high performance for this task. To address these problems, we especially introduce a novel Hierarchical Attention Model (HAM), offering multi-granularity representation and efficient augmentation for both given texts and multi-modal visual inputs. Extensive experimental results demonstrate the superiority of our proposed HAM model. Specifically, HAM ranks first on the large-scale ScanRefer challenge, which outperforms all the existing methods by a significant margin. Codes will be released after acceptance.

updated: Sun Oct 30 2022 09:22:05 GMT+0000 (UTC)

published: Sat Oct 22 2022 18:02:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト