3D Open-vocabulary Segmentation with Foundation Models

Kunhao Liu; Fangneng Zhan; Jiahui Zhang; Muyu Xu; Yingchen Yu; Abdulmotaleb El Saddik; Christian Theobalt; Eric Xing; Shijian Lu

基礎モデルを使用した 3D オープン語彙セグメンテーション

3D シーンのオープンボキャブラリーセグメンテーションは人間の知覚の基本的な機能であり、したがってコンピュータービジョン研究における重要な目的です。ただし、このタスクは、堅牢で一般化可能なモデルをトレーニングするための大規模で多様な 3D オープン語彙セグメンテーションデータセットの欠如によって大きく妨げられています。事前トレーニングされた 2D オープン語彙セグメンテーションモデルから知識を抽出することは役に立ちますが、2D モデルはほとんどが近い語彙データセットで微調整されているため、オープン語彙機能が大幅に損なわれます。私たちは、微調整を必要とせずに、事前トレーニングされた基礎モデル CLIP と DINO のオープン語彙のマルチモーダル知識とオブジェクト推論機能を活用することで、3D オープン語彙セグメンテーションの課題に取り組みます。具体的には、CLIP からのオープンボキャブラリーの視覚的およびテキストの知識を Neural Radiance Field (NeRF) に抽出し、2D 特徴をビュー一貫性のある 3D セグメンテーションに効果的に引き上げます。さらに、関連性-分布アライメント損失と特徴-分布アライメント損失を導入して、それぞれCLIP特徴の曖昧さを緩和し、DINO特徴から正確なオブジェクト境界を抽出し、トレーニング中のセグメンテーションアノテーションの必要性を排除します。広範な実験により、私たちの方法はセグメンテーションアノテーションでトレーニングされた完全教師モデルよりも優れたパフォーマンスを発揮することが示されており、3Dのオープンボキャブラリーセグメンテーションが2D画像およびテキストと画像のペアから効果的に学習できることが示唆されています。

Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature significantly as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting the open-vocabulary multimodal knowledge and object reasoning capability of pre-trained foundation models CLIP and DINO, without necessitating any fine-tuning. Specifically, we distill open-vocabulary visual and textual knowledge from CLIP into a neural radiance field (NeRF) which effectively lifts 2D features into view-consistent 3D segmentation. Furthermore, we introduce the Relevancy-Distribution Alignment loss and Feature-Distribution Alignment loss to respectively mitigate the ambiguities of CLIP features and distill precise object boundaries from DINO features, eliminating the need for segmentation annotations during training. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs.

updated: Tue May 23 2023 14:16:49 GMT+0000 (UTC)

published: Tue May 23 2023 14:16:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト