Masked Autoencoder for Self-Supervised Pre-training on Lidar Point Clouds

Georg Hess; Johan Jaxing; Elias Svensson; David Hagerman; Christoffer Petersson; Lennart Svensson

LIDAR ポイントクラウドでの自己教師あり事前トレーニング用のマスクされたオートエンコーダー

マスクされた自動エンコードは、テキスト、画像、および最近では点群の Transformer モデルの事前トレーニングパラダイムとして成功しています。生の自動車データセットは、3D オブジェクト検出 (OD) などのタスクのアノテーションと比較して、一般的に安価に収集できるため、自己教師あり事前トレーニングの候補として適しています。ただし、点群用のマスクされたオートエンコーダーの開発は、合成データと屋内データのみに焦点を当ててきました。その結果、既存の方法は、均一な点密度を持つ小さくて密な点群に合わせて表現とモデルを調整してきました。この作業では、まばらで、同じシーン内のオブジェクト間でポイント密度が大幅に異なる可能性がある、自動車の設定での点群のマスクされた自動エンコードを研究します。この目的のために、ボクセル表現用に設計された単純なマスクされた自動エンコードの事前トレーニングスキームであるボクセル-MAE を提案します。 Transformer ベースの 3D オブジェクト検出器のバックボーンを事前トレーニングして、マスクされたボクセルを再構築し、空ボクセルと空でないボクセルを区別します。私たちの方法は、挑戦的な nuScenes データセットで 1.75 mAP ポイントと 1.05 NDS で 3D OD パフォーマンスを向上させます。さらに、Voxel-MAE を使用した事前トレーニングにより、ランダムに初期化された同等のデータよりも優れた性能を発揮するために、注釈付きデータの 40% しか必要としないことを示しています。 https://github.com/georghess/voxel-mae で入手可能なコード

Masked autoencoding has become a successful pretraining paradigm for Transformer models for text, images, and, recently, point clouds. Raw automotive datasets are suitable candidates for self-supervised pre-training as they generally are cheap to collect compared to annotations for tasks like 3D object detection (OD). However, the development of masked autoencoders for point clouds has focused solely on synthetic and indoor data. Consequently, existing methods have tailored their representations and models toward small and dense point clouds with homogeneous point densities. In this work, we study masked autoencoding for point clouds in an automotive setting, which are sparse and for which the point density can vary drastically among objects in the same scene. To this end, we propose Voxel-MAE, a simple masked autoencoding pre-training scheme designed for voxel representations. We pre-train the backbone of a Transformer-based 3D object detector to reconstruct masked voxels and to distinguish between empty and non-empty voxels. Our method improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset. Further, we show that by pre-training with Voxel-MAE, we require only 40% of the annotated data to outperform a randomly initialized equivalent. Code available at https://github.com/georghess/voxel-mae

updated: Thu Mar 09 2023 15:16:24 GMT+0000 (UTC)

published: Fri Jul 01 2022 16:31:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト