X^3KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection

Marvin Klingner; Shubhankar Borse; Varun Ravi Kumar; Behnaz Rezaei; Venkatraman Narayanan; Senthil Yogamani; Fatih Porikli

X^3KD: マルチカメラ 3D オブジェクト検出のためのモダリティ、タスク、および段階にわたる知識の抽出

3D オブジェクト検出 (3DOD) の最近の進歩により、LiDAR ベースのモデルで非常に強力な結果が得られました。対照的に、複数のカメラ画像に基づくサラウンドビュー 3DOD モデルは、遠近法ビュー (PV) から 3D 世界表現への必要なビュー変換が原因でパフォーマンスが低下します。このホワイトペーパーでは、マルチカメラ 3DOD のさまざまなモダリティ、タスク、およびステージにわたる包括的な知識抽出フレームワークである X^3KD を紹介します。具体的には、PV 特徴抽出段階でインスタンスセグメンテーションティーチャー (X-IS) からのクロスタスク蒸留を提案し、ビュー変換によるあいまいな誤差逆伝播のない監督を提供します。変換後、クロスモーダル機能抽出 (X-FD) と敵対的トレーニング (X-AT) を適用して、LiDAR ベースの 3DOD 教師に含まれる情報を通じて、マルチカメラ機能の 3D 世界表現を改善します。最後に、クロスモーダル出力蒸留 (X-OD) にもこの教師を採用し、予測段階で綿密な監督を提供します。マルチカメラ 3DOD のさまざまな段階で、知識蒸留の広範なアブレーションを実行します。最終的な X^3KD モデルは、nuScenes および Waymo データセットに対する以前の最先端のアプローチよりも優れており、RADAR ベースの 3DOD に一般化されています。 https://youtu.be/1do9DPFmr38 の定性的な結果のビデオ。

Recent advances in 3D object detection (3DOD) have obtained remarkably strong results for LiDAR-based models. In contrast, surround-view 3DOD models based on multiple camera images underperform due to the necessary view transformation of features from perspective view (PV) to a 3D world representation which is ambiguous due to missing depth information. This paper introduces X^3KD, a comprehensive knowledge distillation framework across different modalities, tasks, and stages for multi-camera 3DOD. Specifically, we propose cross-task distillation from an instance segmentation teacher (X-IS) in the PV feature extraction stage providing supervision without ambiguous error backpropagation through the view transformation. After the transformation, we apply cross-modal feature distillation (X-FD) and adversarial training (X-AT) to improve the 3D world representation of multi-camera features through the information contained in a LiDAR-based 3DOD teacher. Finally, we also employ this teacher for cross-modal output distillation (X-OD), providing dense supervision at the prediction stage. We perform extensive ablations of knowledge distillation at different stages of multi-camera 3DOD. Our final X^3KD model outperforms previous state-of-the-art approaches on the nuScenes and Waymo datasets and generalizes to RADAR-based 3DOD. Qualitative results video at https://youtu.be/1do9DPFmr38.

updated: Fri Mar 03 2023 20:29:49 GMT+0000 (UTC)

published: Fri Mar 03 2023 20:29:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト