3D Point Cloud Pre-training with Knowledge Distillation from 2D Images

Yuan Yao; Yuanhan Zhang; Zhenfei Yin; Jiebo Luo; Wanli Ouyang; Xiaoshui Huang

2D 画像からの知識抽出による 3D ポイントクラウドの事前トレーニング

事前トレーニング済みの 2D ビジョンモデルの最近の成功は、主に大規模なデータセットからの学習によるものです。ただし、2D 画像データセットと比較すると、現在の 3D 点群の事前トレーニングデータは限られています。この制限を克服するために、3D ポイントクラウドの事前トレーニング済みモデルの知識蒸留法を提案し、2D 表現学習モデル、特に CLIP の画像エンコーダーから、概念のアライメントを通じて直接知識を取得します。具体的には、クロスアテンションメカニズムを導入して、3D 点群から概念の特徴を抽出し、それらを 2D 画像からのセマンティック情報と比較します。このスキームでは、点群の事前トレーニング済みモデルが、2D 教師モデルに含まれる豊富な情報から直接学習します。オブジェクト分類、オブジェクト検出、セマンティックセグメンテーション、パーツセグメンテーションなどのダウンストリームタスクで、合成および実世界のデータセットの最先端の 3D 事前トレーニング方法よりも、提案された知識蒸留スキームが高い精度を達成することが広範な実験によって実証されています。 .

The recent success of pre-trained 2D vision models is mostly attributable to learning from large-scale datasets. However, compared with 2D image datasets, the current pre-training data of 3D point cloud is limited. To overcome this limitation, we propose a knowledge distillation method for 3D point cloud pre-trained models to acquire knowledge directly from the 2D representation learning model, particularly the image encoder of CLIP, through concept alignment. Specifically, we introduce a cross-attention mechanism to extract concept features from 3D point cloud and compare them with the semantic information from 2D images. In this scheme, the point cloud pre-trained models learn directly from rich information contained in 2D teacher models. Extensive experiments demonstrate that the proposed knowledge distillation scheme achieves higher accuracy than the state-of-the-art 3D pre-training methods for synthetic and real-world datasets on downstream tasks, including object classification, object detection, semantic segmentation, and part segmentation.

updated: Sat Dec 17 2022 23:21:04 GMT+0000 (UTC)

published: Sat Dec 17 2022 23:21:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト