PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

Xiangyang Zhu; Renrui Zhang; Bowei He; Ziyu Guo; Ziyao Zeng; Zipeng Qin; Shanghang Zhang; Peng Gao

PointCLIP V2: CLIP と GPT による強力な 3D オープンワールド学習の促進

大規模な事前トレーニング済みモデルは、視覚タスクと言語タスクの両方において有望なオープンワールドパフォーマンスを示しています。ただし、3D 点群上で転送される容量は依然として限られており、分類タスクにのみ制限されます。このペーパーでは、まず CLIP と GPT を連携して、PointCLIP V2 という名前の統合 3D オープンワールド学習器を作成します。これにより、ゼロショット 3D 分類、セグメンテーション、および検出の可能性が完全に解放されます。 3D データを事前トレーニングされた言語知識とより適切に調整するために、PointCLIP V2 には 2 つの重要な設計が含まれています。視覚的な目的のために、形状投影モジュールを介して CLIP に、より現実的な深度マップを生成するよう促し、投影された点群と自然画像の間の領域ギャップを狭めます。テキストの終わりでは、GPT モデルに、CLIP のテキストエンコーダーの入力として 3D 固有のテキストを生成するように指示します。 3D ドメインでトレーニングを行わなくても、私たちのアプローチは、ゼロショット 3D 分類の 3 つのデータセットで +42.90%、+40.44%、+28.75% の精度で PointCLIP を大幅に上回ります。それに加えて、V2 は簡単な方法で少数ショット 3D 分類、ゼロショット 3D パーツセグメンテーション、および 3D オブジェクト検出に拡張でき、統合された 3D オープンワールド学習の一般化能力を実証します。

Large-scale pre-trained models have shown promising open-world performance for both vision and language tasks. However, their transferred capacity on 3D point clouds is still limited and only constrained to the classification task. In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection. To better align 3D data with the pre-trained language knowledge, PointCLIP V2 contains two key designs. For the visual end, we prompt CLIP via a shape projection module to generate more realistic depth maps, narrowing the domain gap between projected point clouds with natural images. For the textual end, we prompt the GPT model to generate 3D-specific text as the input of CLIP's textual encoder. Without any training in 3D domains, our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification. On top of that, V2 can be extended to few-shot 3D classification, zero-shot 3D part segmentation, and 3D object detection in a simple manner, demonstrating our generalization ability for unified 3D open-world learning.

updated: Sat Aug 26 2023 16:14:09 GMT+0000 (UTC)

published: Mon Nov 21 2022 17:52:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト