TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Dailan He; Yusheng Zhao; Junyu Luo; Tianrui Hui; Shaofei Huang; Aixi Zhang; Si Liu

TransRefer3D：きめの細かい3Dビジュアルグラウンディング用のエンティティと関係を意識したトランスフォーマー

最近提案されたきめの細かい3D視覚的接地は、同じカテゴリの他の気を散らすオブジェクトから自然言語の文によって参照される3Dオブジェクトを識別することを目的とする、不可欠でやりがいのあるタスクです。既存の作品は通常、動的グラフネットワークを採用してモード内/モード間相互作用を間接的にモデル化し、視覚的および言語的コンテンツのモノリシック表現のために、モデルが参照されたオブジェクトを気を散らすものから区別することを困難にします。この作業では、順列不変の3D点群データに対するTransformerの自然な適合性を活用し、TransRefer3Dネットワークを提案して、オブジェクト間のエンティティと関係を認識するマルチモーダルコンテキストを抽出し、より識別力のある特徴学習を実現します。具体的には、エンティティアウェアアテンション（EA）モジュールとリレーションアウェアアテンション（RA）モジュールを考案して、きめ細かいクロスモーダル機能マッチングを実行します。共同注意操作によって促進される、EAモジュールは視覚エンティティ機能を言語エンティティ機能と照合し、RAモジュールはペアワイズ視覚関係機能を言語関係機能とそれぞれ照合します。さらに、EAモジュールとRAモジュールをエンティティと関係を認識するコンテキストブロック（ERCB）に統合し、いくつかのERCBをスタックして、階層型マルチモーダルコンテキストモデリング用のTransRefer3Dを形成します。 Nr3DとSr3Dの両方のデータセットでの広範な実験は、提案されたモデルが既存のアプローチを最大10.6％大幅に上回り、新しい最先端技術を主張していることを示しています。私たちの知る限り、これは、きめ細かい3D視覚的接地タスクのためのTransformerアーキテクチャを調査する最初の作業です。

Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents. In this work, we exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data and propose a TransRefer3D network to extract entity-and-relation aware multimodal context among objects for more discriminative feature learning. Concretely, we devise an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to conduct fine-grained cross-modal feature matching. Facilitated by co-attention operation, our EA module matches visual entity features with linguistic entity features while RA module matches pair-wise visual relation features with linguistic relation features, respectively. We further integrate EA and RA modules into an Entity-and-Relation aware Contextual Block (ERCB) and stack several ERCBs to form our TransRefer3D for hierarchical multimodal context modeling. Extensive experiments on both Nr3D and Sr3D datasets demonstrate that our proposed model significantly outperforms existing approaches by up to 10.6% and claims the new state-of-the-art. To the best of our knowledge, this is the first work investigating Transformer architecture for fine-grained 3D visual grounding task.

updated: Wed Aug 11 2021 09:25:23 GMT+0000 (UTC)

published: Thu Aug 05 2021 05:47:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト