CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection

Ching-Yu Tseng; Yi-Rong Chen; Hsin-Ying Lee; Tsung-Han Wu; Wen-Chin Chen; Winston H. Hsu

CrossDTR: 3D オブジェクト検出のためのクロスビューおよび深度ガイド付きトランスフォーマー

自動運転のために低コストで正確な 3D オブジェクト検出を実現するために、多くのマルチカメラ手法が提案され、単眼アプローチのオクルージョン問題が解決されました。ただし、正確な推定深度が不足しているため、既存のマルチカメラ手法では、歩行者などの難しい小さなオブジェクトの深度方向の光線に沿って複数のバウンディングボックスが生成されることが多く、再現率が非常に低くなります。さらに、一般に大規模なネットワークアーキテクチャで構成される既存のマルチカメラメソッドに深度予測モジュールを直接適用しても、自動運転アプリケーションのリアルタイム要件を満たすことはできません。これらの問題に対処するために、3D オブジェクト検出用の Cross-view および Depth-guided Transformer、CrossDTR を提案します。まず、軽量の深度予測器は、監視中に余分な深度データセットを使用せずに、正確なオブジェクトごとのスパース深度マップと低次元の深度埋め込みを生成するように設計されています。次に、異なるビューのカメラからの画像特徴と同様に深度埋め込みを融合し、3D バウンディングボックスを生成するために、クロスビュー深度ガイドトランスフォーマーが開発されました。広範な実験により、私たちの方法が既存のマルチカメラ方法を歩行者検出で 10%、mAP および NDS メトリック全体で約 3% 大幅に上回ることが実証されました。また、計算解析により、私たちの方法は従来の方法よりも 5 倍高速であることが示されました。私たちのコードは、https://github.com/sty61010/CrossDTR で公開されます。

To achieve accurate 3D object detection at a low cost for autonomous driving, many multi-camera methods have been proposed and solved the occlusion problem of monocular approaches. However, due to the lack of accurate estimated depth, existing multi-camera methods often generate multiple bounding boxes along a ray of depth direction for difficult small objects such as pedestrians, resulting in an extremely low recall. Furthermore, directly applying depth prediction modules to existing multi-camera methods, generally composed of large network architectures, cannot meet the real-time requirements of self-driving applications. To address these issues, we propose Cross-view and Depth-guided Transformers for 3D Object Detection, CrossDTR. First, our lightweight depth predictor is designed to produce precise object-wise sparse depth maps and low-dimensional depth embeddings without extra depth datasets during supervision. Second, a cross-view depth-guided transformer is developed to fuse the depth embeddings as well as image features from cameras of different views and generate 3D bounding boxes. Extensive experiments demonstrated that our method hugely surpassed existing multi-camera methods by 10 percent in pedestrian detection and about 3 percent in overall mAP and NDS metrics. Also, computational analyses showed that our method is 5 times faster than prior approaches. Our codes will be made publicly available at https://github.com/sty61010/CrossDTR.

updated: Wed Oct 12 2022 05:39:53 GMT+0000 (UTC)

published: Tue Sep 27 2022 16:23:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト