Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Yu-Qi Yang; Yu-Xiao Guo; Jian-Yu Xiong; Yang Liu; Hao Pan; Peng-Shuai Wang; Xin Tong; Baining Guo

Swin3D: 屋内の 3D シーンを理解するための事前トレーニング済みの Transformer バックボーン

微調整を備えた事前トレーニング済みのバックボーンは、2D ビジョンおよび自然言語処理タスクで広く採用されており、タスク固有のネットワークに大きな利点があることが実証されています。このホワイトペーパーでは、Swin3D という名前の事前トレーニング済みの 3D バックボーンを紹介します。これは、ダウンストリームの 3D 屋内シーン理解タスクですべての最先端の方法を最初に凌駕します。私たちのバックボーンネットワークは 3D Swin トランスフォーマーに基づいており、線形メモリの複雑さを持つまばらなボクセルで自己注意を効率的に行い、一般化されたコンテキスト相対位置埋め込みを介してポイント信号の不規則性をキャプチャするように慎重に設計されています。このバックボーン設計に基づいて、ScanNet データセットの 10 倍の合成 Structured3D データセットで大規模な Swin3D モデルを事前トレーニングし、さまざまなダウンストリームの実世界の屋内シーン理解タスクで事前トレーニング済みモデルを微調整しました。結果は、合成データセットで事前トレーニングされたモデルが、ダウンストリームのセグメンテーションと実際の 3D ポイントデータセットでの検出の両方で優れた一般性を示すだけでなく、+2.3 で微調整した後、ダウンストリームタスクの最先端の方法を凌駕することを示しています。 S3DIS Area5 および 6 倍のセマンティックセグメンテーションで mIoU および +2.2 mIoU、ScanNet セグメンテーションで +2.1 mIoU (val)、ScanNet 検出で +1.9 mAP@0.5、S3DIS 検出で +8.1 mAP@0.5。私たちの方法は、3D理解タスクの微調整を備えた事前トレーニング済みの3Dバックボーンの大きな可能性を示しています。コードとモデルは https://github.com/microsoft/Swin3D で入手できます。

Pretrained backbones with fine-tuning have been widely adopted in 2D vision and natural language processing tasks and demonstrated significant advantages to task-specific networks. In this paper, we present a pretrained 3D backbone, named Swin3D, which first outperforms all state-of-the-art methods in downstream 3D indoor scene understanding tasks. Our backbone network is based on a 3D Swin transformer and carefully designed to efficiently conduct self-attention on sparse voxels with linear memory complexity and capture the irregularity of point signals via generalized contextual relative positional embedding. Based on this backbone design, we pretrained a large Swin3D model on a synthetic Structured3D dataset that is 10 times larger than the ScanNet dataset and fine-tuned the pretrained model in various downstream real-world indoor scene understanding tasks. The results demonstrate that our model pretrained on the synthetic dataset not only exhibits good generality in both downstream segmentation and detection on real 3D point datasets, but also surpasses the state-of-the-art methods on downstream tasks after fine-tuning with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +2.1 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, +8.1 mAP@0.5 on S3DIS detection. Our method demonstrates the great potential of pretrained 3D backbones with fine-tuning for 3D understanding tasks. The code and models are available at https://github.com/microsoft/Swin3D .

updated: Mon Apr 24 2023 02:46:34 GMT+0000 (UTC)

published: Fri Apr 14 2023 02:49:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト