Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Yu-Qi Yang; Yu-Xiao Guo; Jian-Yu Xiong; Yang Liu; Hao Pan; Peng-Shuai Wang; Xin Tong; Baining Guo

Swin3D: 3D 屋内シーンを理解するための事前トレーニング済みトランスフォーマーバックボーン

微調整された事前トレーニング済みバックボーンの使用は、2D ビジョンおよび自然言語処理タスクで成功しており、タスク固有のネットワークよりも利点があることが示されています。この作業では、3D 屋内シーンの理解のために必要な、事前トレーニングされた 3D バックボーンを導入します。私たちは、バックボーンネットワークとして 3D Swin トランスフォーマーを設計します。これにより、線形メモリの複雑さを持つ疎なボクセルに対する効率的なセルフアテンションが可能になり、バックボーンが大規模なモデルやデータセットに拡張可能になります。また、ネットワークパフォーマンスを向上させるために、点信号のさまざまな不規則性を捕捉するための一般化されたコンテキスト相対位置埋め込みスキームも導入します。私たちは、ScanNet データセットよりも一桁大きい合成 Structured3D データセット上で大規模な {\SST モデルを事前トレーニングしました。合成データセットで事前トレーニングされたモデルは、実際の 3D ポイントデータセットでの下流のセグメンテーションと検出までうまく一般化するだけでなく、S3DIS Area5 および 6 で +2.3 mIoU および +2.2 mIoU という下流タスクの最先端の手法を上回ります。セマンティックセグメンテーションの倍数、ScanNet セグメンテーション (val) で +1.8 mIoU、ScanNet 検出で +1.9 mAP@0.5、S3DIS 検出で +8.1 mAP@0.5。一連の大規模なアブレーション研究により、当社のアプローチによって実現される拡張性、汎用性、優れたパフォーマンスがさらに検証されています。コードとモデルは https://github.com/microsoft/Swin3D で入手できます。

The use of pretrained backbones with fine-tuning has been successful for 2D vision and natural language processing tasks, showing advantages over task-specific networks. In this work, we introduce a pretrained 3D backbone, called for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large {\SST model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets, but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach. The code and models are available at https://github.com/microsoft/Swin3D .

updated: Wed Aug 16 2023 01:53:02 GMT+0000 (UTC)

published: Fri Apr 14 2023 02:49:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト