CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

Junbo Zhang; Runpei Dong; Kaisheng Ma

CLIP-FO3D: 2D Dense CLIP から無料のオープンワールド 3D シーン表現を学ぶ

3D シーンを理解するモデルをトレーニングするには、人による複雑な注釈が必要です。これを収集するのは骨の折れる作業であり、その結果、近いセットのオブジェクトセマンティクスのみをエンコードするモデルになります。対照的に、視覚言語の事前訓練モデル (例えば、CLIP) は、驚くべきオープンワールドの推論特性を示しています。この目的のために、CLIPの機能空間を3Dシーン理解モデルに監督なしで直接転送することを提案します。最初に、CLIP の入力および転送プロセスを変更して、3D シーンコンテンツの高密度ピクセル機能を抽出できるようにします。次に、マルチビュー画像の特徴を点群に投影し、特徴抽出を使用して 3D シーン理解モデルをトレーニングします。注釈や追加のトレーニングなしで、私たちのモデルは、オープン語彙のセマンティクスとロングテールの概念に対して、有望な注釈のないセマンティックセグメンテーションの結果を達成します。さらに、クロスモーダルの事前トレーニングフレームワークとして機能するこの方法を使用して、微調整中のデータ効率を向上させることができます。私たちのモデルは、さまざまなゼロショットおよびデータ効率の高い学習ベンチマークで、以前の SOTA メソッドよりも優れています。最も重要なことは、私たちのモデルが CLIP の豊富な構造化された知識をうまく継承し、3D シーン理解モデルがオブジェクトの概念だけでなくオープンワールドのセマンティクスも認識できるようにすることです。

Training a 3D scene understanding model requires complicated human annotations, which are laborious to collect and result in a model only encoding close-set object semantics. In contrast, vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties. To this end, we propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision. We first modify CLIP's input and forwarding process so that it can be adapted to extract dense pixel features for 3D scene contents. We then project multi-view image features to the point cloud and train a 3D scene understanding model with feature distillation. Without any annotations or additional training, our model achieves promising annotation-free semantic segmentation results on open-vocabulary semantics and long-tailed concepts. Besides, serving as a cross-modal pre-training framework, our method can be used to improve data efficiency during fine-tuning. Our model outperforms previous SOTA methods in various zero-shot and data-efficient learning benchmarks. Most importantly, our model successfully inherits CLIP's rich-structured knowledge, allowing 3D scene understanding models to recognize not only object concepts but also open-world semantics.

updated: Wed Mar 08 2023 17:30:58 GMT+0000 (UTC)

published: Wed Mar 08 2023 17:30:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト