CMDFusion: Bidirectional Fusion Network with Cross-modality Knowledge Distillation for LIDAR Semantic Segmentation

Jun Cen; Shiwei Zhang; Yixuan Pei; Kun Li; Hang Zheng; Maochun Luo; Yingya Zhang; Qifeng Chen

CMDFusion: LIDAR セマンティックセグメンテーションのためのクロスモダリティ知識蒸留を備えた双方向融合ネットワーク

2D RGB 画像と 3D LIDAR 点群は、自動運転車の認識システムに補完的な知識を提供します。 LIDAR セマンティックセグメンテーションタスクのために、いくつかの 2D および 3D 融合手法が検討されていますが、それらにはさまざまな問題があります。 2D から 3D の融合手法では、推論中に厳密にペアになったデータが必要ですが、現実のシナリオでは利用できない可能性があります。一方、3D から 2D の融合手法では、2D 情報を明示的に最大限に活用することはできません。したがって、この研究では、クロスモダリティ知識蒸留を備えた双方向融合ネットワーク (CMDFusion) を提案します。私たちの方法には 2 つの貢献があります。まず、私たちの双方向融合スキームは、それぞれ 2D から 3D の融合および 3D から 2D の融合を通じて 3D 機能を明示的および暗黙的に強化し、単一融合スキームのいずれかを上回ります。次に、2D ネットワーク (カメラブランチ) から 3D ネットワーク (2D ナレッジブランチ) に 2D 知識を抽出して、3D ネットワークがカメラの FOV (視野) 内にない点についても 2D 情報を生成できるようにします。このようにして、2D 知識ブランチが 3D LIDAR 入力に従って 2D 情報を提供するため、推論中に RGB 画像は必要なくなります。 CMDFusion が、SemanticKITTI および nuScenes データセットに対するすべてのフュージョンベースのメソッドの中で最高のパフォーマンスを達成することを示します。コードは https://github.com/Jun-CEN/CMDFusion でリリースされます。

2D RGB images and 3D LIDAR point clouds provide complementary knowledge for the perception system of autonomous vehicles. Several 2D and 3D fusion methods have been explored for the LIDAR semantic segmentation task, but they suffer from different problems. 2D-to-3D fusion methods require strictly paired data during inference, which may not be available in real-world scenarios, while 3D-to-2D fusion methods cannot explicitly make full use of the 2D information. Therefore, we propose a Bidirectional Fusion Network with Cross-Modality Knowledge Distillation (CMDFusion) in this work. Our method has two contributions. First, our bidirectional fusion scheme explicitly and implicitly enhances the 3D feature via 2D-to-3D fusion and 3D-to-2D fusion, respectively, which surpasses either one of the single fusion schemes. Second, we distillate the 2D knowledge from a 2D network (Camera branch) to a 3D network (2D knowledge branch) so that the 3D network can generate 2D information even for those points not in the FOV (field of view) of the camera. In this way, RGB images are not required during inference anymore since the 2D knowledge branch provides 2D information according to the 3D LIDAR input. We show that our CMDFusion achieves the best performance among all fusion-based methods on SemanticKITTI and nuScenes datasets. The code will be released at https://github.com/Jun-CEN/CMDFusion.

updated: Sun Jul 09 2023 04:24:12 GMT+0000 (UTC)

published: Sun Jul 09 2023 04:24:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト