SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

Zhenyu Li; Zehui Chen; Ang Li; Liangji Fang; Qinhong Jiang; Xianming Liu; Junjun Jiang; Bolei Zhou; Hang Zhao

SimIPU：空間認識視覚表現のための単純な2D画像と3D点群の教師なし事前トレーニング

事前トレーニングは、多くのコンピュータビジョンタスクの標準的なパラダイムになっています。ただし、ほとんどの方法は通常、RGB画像ドメインで設計されています。 2次元画像平面と3次元空間の間の不一致のために、そのような事前に訓練されたモデルは、空間情報を認識できず、3D関連のタスクの次善の解決策として機能します。このギャップを埋めるために、3次元空間を記述でき、これらのタスクにより適した効果的な空間認識視覚表現を学習することを目指しています。画像と比較して空間情報の提供にはるかに優れている点群を活用するために、SimIPUと呼ばれるシンプルで効果的な2D画像と3D点群の監視されていない事前トレーニング戦略を提案します。具体的には、点群から空間認識表現を学習するイントラモーダル空間知覚モジュールと、点から空間情報を知覚する機能を転送するインターモーダル機能相互作用モジュールで構成されるマルチモーダル対照学習フレームワークを開発します。それぞれ、クラウドエンコーダからイメージエンコーダへ。対照的な損失の正のペアは、マッチングアルゴリズムと射影行列によって確立されます。フレームワーク全体は、教師なしのエンドツーエンドの方法でトレーニングされます。私たちの知る限り、これは、ペアのカメラ画像とLIDARポイントクラウドを含む、屋外のマルチモーダルデータセットの対照的な学習事前トレーニング戦略を調査する最初の研究です。コードとモデルはhttps://github.com/zhyever/SimIPUで入手できます。

Pre-training has become a standard paradigm in many computer vision tasks. However, most of the methods are generally designed on the RGB image domain. Due to the discrepancy between the two-dimensional image plane and the three-dimensional space, such pre-trained models fail to perceive spatial information and serve as sub-optimal solutions for 3D-related tasks. To bridge this gap, we aim to learn a spatial-aware visual representation that can describe the three-dimensional space and is more suitable and effective for these tasks. To leverage point clouds, which are much more superior in providing spatial information compared to images, we propose a simple yet effective 2D Image and 3D Point cloud Unsupervised pre-training strategy, called SimIPU. Specifically, we develop a multi-modal contrastive learning framework that consists of an intra-modal spatial perception module to learn a spatial-aware representation from point clouds and an inter-modal feature interaction module to transfer the capability of perceiving spatial information from the point cloud encoder to the image encoder, respectively. Positive pairs for contrastive losses are established by the matching algorithm and the projection matrix. The whole framework is trained in an unsupervised end-to-end fashion. To the best of our knowledge, this is the first study to explore contrastive learning pre-training strategies for outdoor multi-modal datasets, containing paired camera images and LIDAR point clouds. Codes and models are available at https://github.com/zhyever/SimIPU.

updated: Mon Jan 17 2022 06:57:30 GMT+0000 (UTC)

published: Thu Dec 09 2021 03:27:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト