Unifying Voxel-based Representation with Transformer for 3D Object Detection

Yanwei Li; Yilun Chen; Xiaojuan Qi; Zeming Li; Jian Sun; Jiaya Jia

3D オブジェクト検出のための Transformer によるボクセルベースの表現の統合

この作業では、UVTR という名前のマルチモダリティ 3D オブジェクト検出のための統一されたフレームワークを提示します。提案された方法は、ボクセル空間内のマルチモダリティ表現を統合して、正確で堅牢な単一モダリティまたはクロスモダリティ 3D 検出を目的としています。この目的のために、モダリティ固有の空間は、ボクセル特徴空間内のさまざまな入力を表すように最初に設計されています。以前の作業とは異なり、私たちのアプローチは、高さ圧縮なしでボクセル空間を保持して、セマンティックなあいまいさを軽減し、空間接続を可能にします。さまざまなセンサーからの入力を最大限に活用するために、知識の伝達とモダリティの融合を含むクロスモダリティの相互作用が提案されます。このように、点群のジオメトリを意識した式と画像のコンテキストリッチな機能がうまく利用され、パフォーマンスと堅牢性が向上します。トランスフォーマーデコーダーは、学習可能な位置を持つ統合空間から特徴を効率的にサンプリングするために適用され、オブジェクトレベルの相互作用を容易にします。一般に、UVTR は、統一されたフレームワークでさまざまなモダリティを表す初期の試みを提示します。これは、単一モダリティまたはマルチモダリティエントリで以前の作業を上回っています。提案された方法は、オブジェクト検出と次のオブジェクト追跡タスクの両方について、nuScenes テストセットで優れたパフォーマンスを達成します。コードは、https://github.com/dvlab-research/UVTR で公開されています。

In this work, we present a unified framework for multi-modality 3D object detection, named UVTR. The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection. To this end, the modality-specific space is first designed to represent different inputs in the voxel feature space. Different from previous work, our approach preserves the voxel space without height compression to alleviate semantic ambiguity and enable spatial connections. To make full use of the inputs from different sensors, the cross-modality interaction is then proposed, including knowledge transfer and modality fusion. In this way, geometry-aware expressions in point clouds and context-rich features in images are well utilized for better performance and robustness. The transformer decoder is applied to efficiently sample features from the unified space with learnable positions, which facilitates object-level interactions. In general, UVTR presents an early attempt to represent different modalities in a unified framework. It surpasses previous work in single- or multi-modality entries. The proposed method achieves leading performance in the nuScenes test set for both object detection and the following object tracking task. Code is made publicly available at https://github.com/dvlab-research/UVTR.

updated: Thu Oct 13 2022 03:32:33 GMT+0000 (UTC)

published: Wed Jun 01 2022 17:02:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト