M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

Enze Xie; Zhiding Yu; Daquan Zhou; Jonah Philion; Anima Anandkumar; Sanja Fidler; Ping Luo; Jose M. Alvarez

M ^ 2BEV：統一された鳥の目のビュー表現によるマルチカメラジョイント3D検出とセグメンテーション

この論文では、マルチカメラ画像入力を使用して、Birds Eye View〜（BEV）空間で3Dオブジェクト検出とマップセグメンテーションを共同で実行する統合フレームワークであるM^2BEVを提案します。検出とセグメンテーションを別々に処理する以前の作品の大部分とは異なり、M ^ 2BEVは統一されたモデルで両方のタスクを推測し、効率を向上させます。 M ^ 2BEVは、マルチビュー2D画像機能をエゴカー座標の3DBEV機能に効率的に変換します。このようなBEV表現は、さまざまなタスクが単一のエンコーダーを共有できるようにするために重要です。私たちのフレームワークには、精度と効率の両方に役立つ4つの重要な設計がさらに含まれています。（1）ボクセル特徴マップの空間次元を縮小する効率的なBEVエンコーダー設計。（2）一致学習を使用して、アンカー付きのグラウンドトゥルース3Dボックスを割り当てる動的ボックス割り当て戦略。（3）より遠い予測のためにより大きな重みで補強するBEV中心性再重み付け、および（4）大規模な2D検出の事前トレーニングと補助監視。これらの設計は、深度情報が欠落している不適切なカメラベースの3D知覚タスクに大きなメリットがあることを示しています。 M ^ 2BEVはメモリ効率が高く、入力として非常に高解像度の画像を可能にし、推論速度を高速化します。 nuScenesでの実験は、M ^ 2BEVが3Dオブジェクト検出とBEVセグメンテーションの両方で最先端の結果を達成し、これら2つのタスクでそれぞれ42.5mAPと57.0mIoUを達成する最高の単一モデルを示しています。

In this paper, we propose M^2BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, M^2BEV infers both tasks with a unified model and improves efficiency. M^2BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. Such BEV representation is important as it enables different tasks to share a single encoder. Our framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. We show that these designs significantly benefit the ill-posed camera-based 3D perception tasks where depth information is missing. M^2BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed. Experiments on nuScenes show that M^2BEV achieves state-of-the-art results in both 3D object detection and BEV segmentation, with the best single model achieving 42.5 mAP and 57.0 mIoU in these two tasks, respectively.

updated: Tue Apr 19 2022 05:40:19 GMT+0000 (UTC)

published: Mon Apr 11 2022 13:43:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト