3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection

Jongwoo Park; Apoorv Singh; Varun Bankiti

3M3D: 3D オブジェクト検出のためのマルチビュー、マルチパス、マルチ表現

マルチカメラ画像に基づく 3D 視覚認識タスクは、自動運転システムに不可欠です。この分野の最新の研究では、マルチビュー画像を入力として活用し、クロスアテンドマルチビュー機能によってオブジェクトクエリ (オブジェクト提案) を繰り返し強化することで、3D オブジェクト検出を実行します。ただし、個々のバックボーン機能はマルチビュー機能で更新されず、単一画像バックボーンネットワークの出力の単なるコレクションとして残ります。したがって、3M3D: 3D オブジェクト検出のためのマルチビュー、マルチパス、マルチ表現を提案します。ここでは、マルチビュー機能とクエリ機能の両方を更新して、細かいパノラマビューと粗いグローバルビューの両方でシーンの表現を強化します。まず、多視点軸の自己注意によって多視点機能を更新します。マルチビュー機能にパノラマ情報を組み込み、グローバルシーンの理解を深めます。第二に、特徴の局所的な詳細をエンコードする ROI (Region of Interest) ウィンドウの自己注意により、マルチビュー機能を更新します。マルチビュー軸だけでなく、他の空間次元に沿って情報を交換するのにも役立ちます。最後に、さまざまなドメインでのクエリの複数表現の事実を活用して、パフォーマンスをさらに向上させます。ここでは、スパースフローティングクエリを高密度 BEV (Bird's Eye View) クエリと共に使用します。これらは後で後処理されて重複検出をフィルタリングします。さらに、ベースラインに加えて、nuScenes ベンチマークデータセットのパフォーマンスの向上を示しています。

3D visual perception tasks based on multi-camera images are essential for autonomous driving systems. Latest work in this field performs 3D object detection by leveraging multi-view images as an input and iteratively enhancing object queries (object proposals) by cross-attending multi-view features. However, individual backbone features are not updated with multi-view features and it stays as a mere collection of the output of the single-image backbone network. Therefore we propose 3M3D: A Multi-view, Multi-path, Multi-representation for 3D Object Detection where we update both multi-view features and query features to enhance the representation of the scene in both fine panoramic view and coarse global view. Firstly, we update multi-view features by multi-view axis self-attention. It will incorporate panoramic information in the multi-view features and enhance understanding of the global scene. Secondly, we update multi-view features by self-attention of the ROI (Region of Interest) windows which encodes local finer details in the features. It will help exchange the information not only along the multi-view axis but also along the other spatial dimension. Lastly, we leverage the fact of multi-representation of queries in different domains to further boost the performance. Here we use sparse floating queries along with dense BEV (Bird's Eye View) queries, which are later post-processed to filter duplicate detections. Moreover, we show performance improvements on nuScenes benchmark dataset on top of our baselines.

updated: Tue Mar 07 2023 14:59:28 GMT+0000 (UTC)

published: Thu Feb 16 2023 11:28:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト