Improving Standard Transformer Models for 3D Point Cloud Understanding with Image Pretraining

Guocheng Qian; Xingdi Zhang; Abdullah Hamdi; Bernard Ghanem

画像の事前トレーニングによる 3D 点群の理解のための標準変圧器モデルの改善

標準変換 (ST) モデルは、自然言語処理とコンピュータービジョンで目覚ましい成功を収めていますが、3D 点群でのパフォーマンスは比較的貧弱です。これは主に、トランスフォーマーの制限によるものです。つまり、大規模なトレーニングデータに対する厳しい要求です。残念ながら、3D 点群の領域では、大規模なデータセットの可用性が課題であり、3D タスク用の ST モデルのトレーニングの問題を悪化させています。この作業では、点群の ST モデルを改善するための 2 つの貢献を提案します。まず、Progressive Point Patch Embedding をトークナイザーとして使用し、Feature Propagation をグローバル表現の追加をデコーダーとして使用して、新しい ST ベースのポイントクラウドネットワークに貢献します。私たちのネットワークは、データをあまり必要としないことが示されており、ST は最先端に匹敵するパフォーマンスを達成することができます。次に、Pix4Point と呼ばれるシンプルで効果的なパイプラインを策定します。これにより、画像ドメインで事前トレーニングされた Transformer を利用して、下流の点群の理解を強化できます。これは、3D ドメインに特化した提案されたトークナイザーとデコーダーの助けを借りて、モダリティに依存しない ST バックボーンによって実現されます。 ScanObjectNN、ShapeNetPart、および S3DIS ベンチマークのそれぞれで、3D ポイントクラウド分類、パーツセグメンテーション、およびセマンティックセグメンテーションのタスクで、ST モデルの大幅な向上が見られます。私たちのコードとモデルは、https://github.com/guochengqian/Pix4Point で入手できます。

While Standard Transformer (ST) models have achieved impressive success in natural language processing and computer vision, their performance on 3D point clouds is relatively poor. This is mainly due to the limitation of Transformers: a demanding need for large training data. Unfortunately, in the realm of 3D point clouds, the availability of large datasets is a challenge, which exacerbates the issue of training ST models for 3D tasks. In this work, we propose two contributions to improve ST models on point clouds. First, we contribute a new ST-based point cloud network, by using Progressive Point Patch Embedding as the tokenizer and Feature Propagation with global representation appending as the decoder. Our network is shown to be less hungry for data, and enables ST to achieve performance comparable to the state-of-the-art. Second, we formulate a simple yet effective pipeline dubbed Pix4Point, which allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding. This is achieved through a modality-agnostic ST backbone with the help of our proposed tokenizer and decoder specialized in the 3D domain. Pretrained on a large number of widely available images, we observe significant gains of our ST model in the tasks of 3D point cloud classification, part segmentation, and semantic segmentation on ScanObjectNN, ShapeNetPart, and S3DIS benchmarks, respectively. Our code and models are available at: https://github.com/guochengqian/Pix4Point.

updated: Tue Nov 22 2022 22:02:10 GMT+0000 (UTC)

published: Thu Aug 25 2022 17:59:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト