Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Renrui Zhang; Ziyu Guo; Rongyao Fang; Bin Zhao; Dong Wang; Yu Qiao; Hongsheng Li; Peng Gao

Point-M2AE: 階層点群事前トレーニング用のマルチスケールマスクオートエンコーダー

マスクオートエンコーダー (MAE) は、言語および 2D 画像トランスフォーマーの自己教師あり事前トレーニングで大きな可能性を示しています。ただし、不規則な点群の 3D 表現を学習するためにマスクされた自動エンコードをどのように活用するかについては、未解決の問題が残っています。この論文では、3D点群の階層的自己教師あり学習のための強力なマルチスケールMAE事前トレーニングフレームワークであるPoint-M2AEを提案します。 MAE の標準的なトランスフォーマーとは異なり、エンコーダーとデコーダーをピラミッドアーキテクチャに変更して、空間ジオメトリを段階的にモデル化し、3D 形状の詳細なセマンティクスと高レベルのセマンティクスの両方をキャプチャします。ポイントトークンを段階的にダウンサンプリングするエンコーダーの場合、スケール間で一貫した可視領域を生成するマルチスケールマスキング戦略を設計し、微調整中にローカル空間自己注意メカニズムを採用して、隣接するパターンに焦点を当てます。マルチスケールのトークン伝播により、軽量デコーダーは、エンコーダーからの補完的なスキップ接続を使用してポイントトークンを徐々にアップサンプリングします。これにより、グローバルからローカルへの観点からの再構築がさらに促進されます。広範な実験により、3D 表現学習のための Point-M2AE の最先端のパフォーマンスが実証されています。事前トレーニング後にフリーズしたエンコーダーを使用すると、Point-M2AE は ModelNet40 の線形 SVM で 92.9% の精度を達成し、完全にトレーニングされたいくつかの方法を上回っています。ダウンストリームタスクを微調整することにより、Point-M2AE は ScanObjectNN で 86.43% の精度を達成し、2 番目に良い精度で +3.36% を達成し、階層的な事前トレーニングスキームを使用した少数ショット分類、パーツセグメンテーション、および 3D オブジェクト検出に大きなメリットをもたらします。コードは https://github.com/ZrrSkywalker/Point-M2AE で入手できます。

Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pre-training, Point-M2AE achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86.43% accuracy on ScanObjectNN, +3.36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https://github.com/ZrrSkywalker/Point-M2AE.

updated: Thu Oct 13 2022 18:02:57 GMT+0000 (UTC)

published: Sat May 28 2022 11:22:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト