Voxel-MAE: Masked Autoencoders for Self-supervised Pre-training Large-scale Point Clouds

Chen Min; Xinli Xu; Dawei Zhao; Liang Xiao; Yiming Nie; Bin Dai

Voxel-MAE: 大規模点群の自己教師あり事前トレーニング用のマスクされたオートエンコーダ

自動運転における現在の認識モデルは、大規模なラベル付き 3D データに大きく依存しています。ただし、3D データに注釈を付けるには、費用と時間がかかります。この作業では、自動運転におけるラベルのない膨大な 3D データからの自己教師あり学習に関する研究を促進することを目指しています。ボクセル-MAE と呼ばれる、大規模な点群を事前トレーニングするためのマスクされた自動エンコードフレームワークを紹介します。大規模な点群の幾何学的特性を利用して、範囲を意識したランダムマスキング戦略とバイナリボクセル分類タスクを提案します。具体的には、ポイントクラウドをボリューム表現に変換し、キャプチャデバイスまでの距離に応じてボクセルをランダムにマスクします。 Voxel-MAE は、マスクされたボクセルの占有値を再構築し、ボクセルに点群が含まれているかどうかを識別します。この単純なバイナリボクセル分類目的により、Voxel-MAE は高レベルのセマンティクスを推論して、少量の可視ボクセルのみからマスクされたボクセルを回復することができます。広範な実験により、いくつかのダウンストリームタスクでの Voxel-MAE の有効性が実証されています。 3D オブジェクト検出タスクの場合、Voxel-MAE は、KITTI での車の検出用にラベル付けされたデータを半分削減し、Waymo での小さなオブジェクト検出を約 2% mAP 向上させます。 3D セマンティックセグメンテーションタスクの場合、Voxel-MAE は nuScenes でゼロからのトレーニングよりも約 2% mIOU 優れています。初めて、Voxel-MAE は、自動運転の 3D 認識能力を強化するために、マスクされた自動エンコードを使用して、ラベルのない大規模な点群を事前にトレーニングできることを示しています。

Current perception models in autonomous driving greatly rely on large-scale labeled 3D data. However, it is expensive and time-consuming to annotate 3D data. In this work, we aim at facilitating research on self-supervised learning from the vast unlabeled 3D data in autonomous driving. We introduce a masked autoencoding framework for pre-training large-scale point clouds, dubbed Voxel-MAE. We take advantage of the geometric characteristics of large-scale point clouds, and propose the range-aware random masking strategy and binary voxel classification task. Specifically, we transform point clouds into volumetric representations, and randomly mask voxels according to their distance to the capture device. Voxel-MAE reconstructs the occupancy values of masked voxels and distinguishes whether the voxels contain point clouds. This simple binary voxel classification objective encourages Voxel-MAE to reason over high-level semantics to recover the masked voxel from only a small amount of visible voxels. Extensive experiments demonstrate the effectiveness of Voxel-MAE across several downstream tasks. For the 3D object detection task, Voxel-MAE reduces half labeled data for car detection on KITTI and boosts small object detection by around 2% mAP on Waymo. For the 3D semantic segmentation task, Voxel-MAE outperforms training from scratch by around 2% mIOU on nuScenes. For the first time, our Voxel-MAE shows that it is feasible to pre-train unlabeled large-scale point clouds with masked autoencoding to enhance the 3D perception ability of autonomous driving.

updated: Wed Nov 23 2022 06:15:30 GMT+0000 (UTC)

published: Mon Jun 20 2022 17:15:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト