Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

Yuheng Lu; Chenfeng Xu; Xiaobao Wei; Xiaodong Xie; Masayoshi Tomizuka; Kurt Keutzer; Shanghang Zhang

画像レベルのクラスと偏りのないクロスモーダル対照学習によるオープンボキャブラリー3D検出

現在の点群検出方法は、一般化機能が限られているため、現実の世界でオープンボキャブラリーオブジェクトを検出するのが困難です。さらに、多数のクラスのオブジェクトで点群検出データセットを収集して完全に注釈を付けることは非常に手間と費用がかかり、既存の点群データセットのクラスが制限され、モデルが一般的な表現を学習してオープンボキャブラリーポイントを達成するのを妨げます-クラウド検出。私たちが知る限り、私たちはオープンボキャブラリーの3D点群検出の問題を最初に研究しました。完全なラベルを持つポイントクラウドデータセットを探す代わりに、ImageNet1Kを使用して、ポイントクラウド検出器の語彙を広げます。画像レベルのクラス監視を使用したオープンボキャブラリー3DDETectorであるOV-3DETICを提案します。具体的には、認識用の画像モダリティとローカリゼーション用の点群モダリティの2つのモダリティを利用して、見えないクラスの疑似ラベルを生成します。次に、トレーニング中に画像モダリティから点群モダリティに知識を転送するための新しい偏りのないクロスモーダル対照学習法を提案します。 OV-3DETICは、推論中の遅延を損なうことなく、点群検出器でオープンボキャブラリー検出を実現できるようにします。広範な実験により、提案されたOV-3DETICは、SUN-RGBDデータセットとScanNetデータセットの幅広いベースラインによって、それぞれ少なくとも10.77％のmAPの改善（絶対値）と9.56％のmAPの改善（絶対値）を達成することが示されています。さらに、提案されたOV-3DETICが機能する理由を明らかにするために十分な実験を行います。

Current point-cloud detection methods have difficulty detecting the open-vocabulary objects in the real world, due to their limited generalization capability. Moreover, it is extremely laborious and expensive to collect and fully annotate a point-cloud detection dataset with numerous classes of objects, leading to the limited classes of existing point-cloud datasets and hindering the model to learn general representations to achieve open-vocabulary point-cloud detection. As far as we know, we are the first to study the problem of open-vocabulary 3D point-cloud detection. Instead of seeking a point-cloud dataset with full labels, we resort to ImageNet1K to broaden the vocabulary of the point-cloud detector. We propose OV-3DETIC, an Open-Vocabulary 3D DETector using Image-level Class supervision. Specifically, we take advantage of two modalities, the image modality for recognition and the point-cloud modality for localization, to generate pseudo labels for unseen classes. Then we propose a novel debiased cross-modal contrastive learning method to transfer the knowledge from image modality to point-cloud modality during training. Without hurting the latency during inference, OV-3DETIC makes the point-cloud detector capable of achieving open-vocabulary detection. Extensive experiments demonstrate that the proposed OV-3DETIC achieves at least 10.77 % mAP improvement (absolute value) and 9.56 % mAP improvement (absolute value) by a wide range of baselines on the SUN-RGBD dataset and ScanNet dataset, respectively. Besides, we conduct sufficient experiments to shed light on why the proposed OV-3DETIC works.

updated: Tue Jul 05 2022 12:13:52 GMT+0000 (UTC)

published: Tue Jul 05 2022 12:13:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト