Unifying Voxel-based Representation with Transformer for 3D Object Detection

Yanwei Li; Yilun Chen; Xiaojuan Qi; Zeming Li; Jian Sun; Jiaya Jia

3Dオブジェクト検出のためのTransformerによるボクセルベースの表現の統合

この作業では、UVTRという名前のマルチモダリティ3Dオブジェクト検出のための統一されたフレームワークを紹介します。提案された方法は、正確でロバストなシングルモダリティまたはクロスモダリティの3D検出のために、ボクセル空間でマルチモダリティ表現を統合することを目的としています。この目的のために、モダリティ固有の空間は、最初にボクセル特徴空間のさまざまな入力を表すように設計されています。以前の作業とは異なり、私たちのアプローチは、高さを圧縮せずにボクセルスペースを保持して、意味のあいまいさを軽減し、空間的な相互作用を可能にします。次に、統一された方法の恩恵を受けて、知識の伝達やモダリティの融合など、さまざまなセンサーからの固有のプロパティを最大限に活用するために、モダリティ間の相互作用が提案されます。このように、点群でのジオメトリ対応の表現と画像でのコンテキストが豊富な機能は、パフォーマンスと堅牢性を向上させるために十分に活用されています。トランスフォーマーデコーダーは、学習可能な位置を持つ統合空間から機能を効率的にサンプリングするために適用され、オブジェクトレベルの相互作用を容易にします。一般に、UVTRは、統一されたフレームワークでさまざまなモダリティを表現するための初期の試みを示しています。これは、シングルモダリティおよびマルチモダリティエントリの以前の作業を上回り、LiDAR、カメラ、およびマルチモダリティ入力に対してそれぞれ69.7％、55.1％、および71.1％のNDSを備えたnuScenesテストセットで最高のパフォーマンスを実現します。コードはhttps://github.com/dvlab-research/UVTRで入手できます。

In this work, we present a unified framework for multi-modality 3D object detection, named UVTR. The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection. To this end, the modality-specific space is first designed to represent different inputs in the voxel feature space. Different from previous work, our approach preserves the voxel space without height compression to alleviate semantic ambiguity and enable spatial interactions. Benefit from the unified manner, cross-modality interaction is then proposed to make full use of inherent properties from different sensors, including knowledge transfer and modality fusion. In this way, geometry-aware expressions in point clouds and context-rich features in images are well utilized for better performance and robustness. The transformer decoder is applied to efficiently sample features from the unified space with learnable positions, which facilitates object-level interactions. In general, UVTR presents an early attempt to represent different modalities in a unified framework. It surpasses previous work in single- and multi-modality entries and achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively. Code is made available at https://github.com/dvlab-research/UVTR.

updated: Wed Jun 01 2022 17:02:40 GMT+0000 (UTC)

published: Wed Jun 01 2022 17:02:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト