Towards Label-free Scene Understanding by Vision Foundation Models

Runnan Chen; Youquan Liu; Lingdong Kong; Nenglun Chen; Xinge Zhu; Yuexin Ma; Tongliang Liu; Wenping Wang

Vision Foundation モデルによるラベルフリーのシーン理解に向けて

Contrastive Vision-Language Pre-training (CLIP) や Segment Anything (SAM) などの視覚基盤モデルは、画像分類およびセグメンテーションタスクにおいて優れたゼロショットパフォーマンスを実証しています。ただし、ラベルなしでシーンを理解するための CLIP と SAM の組み込みはまだ検討されていません。この論文では、ネットワークがラベル付きデータなしで 2D および 3D 世界を理解できるようにするビジョン基盤モデルの可能性を調査します。主な課題は、非常にノイズの多い擬似ラベルの下でネットワークを効果的に監視することにあります。擬似ラベルは CLIP によって生成され、2D ドメインから 3D ドメインへの伝播中にさらに悪化します。これらの課題に取り組むために、CLIP と SAM の長所を活用して 2D ネットワークと 3D ネットワークを同時に監視する新しいクロスモダリティノイズ監視 (CNS) 方法を提案します。特に、2D ネットワークと 3D ネットワークを同時トレーニングするために予測一貫性正則化を導入し、SAM の堅牢な特徴表現を使用してネットワークの潜在空間一貫性をさらに課します。屋内および屋外のさまざまなデータセットに対して行われた実験により、2D および 3D のオープン環境を理解する際のこの手法の優れたパフォーマンスが実証されました。当社の 2D および 3D ネットワークは、ScanNet 上で 28.4% および 33.5% の mIoU でラベルフリーのセマンティックセグメンテーションを実現し、それぞれ 4.7% および 7.9% 向上しました。また、nuScenes データセットのパフォーマンスは 26.8% で、6% 改善されました。コードは公開されます (https://github.com/runnanchen/Label-Free-Scene-Understanding)。

Vision foundation models such as Contrastive Vision-Language Pre-training (CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot performance on image classification and segmentation tasks. However, the incorporation of CLIP and SAM for label-free scene understanding has yet to be explored. In this paper, we investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data. The primary challenge lies in effectively supervising networks under extremely noisy pseudo labels, which are generated by CLIP and further exacerbated during the propagation from the 2D to the 3D domain. To tackle these challenges, we propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously. In particular, we introduce a prediction consistency regularization to co-train 2D and 3D networks, then further impose the networks' latent space consistency using the SAM's robust feature representation. Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. Our 2D and 3D network achieves label-free semantic segmentation with 28.4% and 33.5% mIoU on ScanNet, improving 4.7% and 7.9%, respectively. And for nuScenes dataset, our performance is 26.8% with an improvement of 6%. Code will be released (https://github.com/runnanchen/Label-Free-Scene-Understanding).

updated: Tue Jun 06 2023 17:57:49 GMT+0000 (UTC)

published: Tue Jun 06 2023 17:57:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト