Multi-Modal Masked Pre-Training for Monocular Panoramic Depth Completion

Zhiqiang Yan; Xiang Li; Kun Wang; Zhenyu Zhang; Jun Li; Jian Yang

単眼パノラマ深度完了のためのマルチモーダルマスクされた事前トレーニング

このホワイトペーパーでは、パノラマ3Dカメラが複雑なシーンでデータが欠落している360°の深度を生成することが多いため、潜在的に価値のあるパノラマ深度完了（PDC）タスクを作成します。その目標は、生のまばらな画像とパノラマRGB画像から高密度のパノラマ深度を復元することです。 PDCタスクを処理するために、高密度のパノラマ深度回復の入力として深度と画像の両方を取得するディープネットワークをトレーニングします。ただし、非凸目的関数のため、ネットワークパラメータの困難な最適化問題に直面する必要があります。この問題に対処するために、M ^ 3PTと呼ばれるシンプルで効果的なアプローチを提案します：マルチモーダルマスクされた事前トレーニング。具体的には、事前トレーニング中に、パノラマRGB画像のパッチとスパース深度を共有ランダムマスクで同時にカバーし、マスクされた領域のスパース深度を再構築します。私たちの知る限り、マスクされたオートエンコーダー（MAE）によって解決されるシングルモーダルタスクではなく、マルチモーダルビジョンタスクでマスクされた事前トレーニングの有効性を示すのは初めてです。微調整によって事前トレーニングのデコーダー部分が完全に破棄されるMAEとは異なり、M ^ 3PTの事前トレーニング段階と微調整段階は予測密度のみが異なるため、アーキテクチャ上の違いはありません。より便利で効果的な転移学習。広範な実験により、3つのパノラマデータセットに対するM^3PTの有効性が検証されます。特に、3つのベンチマークデータセットで、最新のベースラインをRMSEで平均26.2％、MREで51.7％、MAEで49.7％、RMSElogで37.5％改善しています。コードと事前トレーニング済みモデルは、https：//github.com/anonymoustbd/MMMPTで入手できます。

In this paper, we formulate a potentially valuable panoramic depth completion (PDC) task as panoramic 3D cameras often produce 360° depth with missing data in complex scenes. Its goal is to recover dense panoramic depths from raw sparse ones and panoramic RGB images. To deal with the PDC task, we train a deep network that takes both depth and image as inputs for the dense panoramic depth recovery. However, it needs to face a challenging optimization problem of the network parameters due to its non-convex objective function. To address this problem, we propose a simple yet effective approach termed M^3PT: multi-modal masked pre-training. Specifically, during pre-training, we simultaneously cover up patches of the panoramic RGB image and sparse depth by shared random mask, then reconstruct the sparse depth in the masked regions. To our best knowledge, it is the first time that we show the effectiveness of masked pre-training in a multi-modal vision task, instead of the single-modal task resolved by masked autoencoders (MAE). Different from MAE where fine-tuning completely discards the decoder part of pre-training, there is no architectural difference between the pre-training and fine-tuning stages in our M^3PT as they only differ in the prediction density, which potentially makes the transfer learning more convenient and effective. Extensive experiments verify the effectiveness of M^3PT on three panoramic datasets. Notably, we improve the state-of-the-art baselines by averagely 26.2% in RMSE, 51.7% in MRE, 49.7% in MAE, and 37.5% in RMSElog on three benchmark datasets. Codes and pre-trained models are available at https://github.com/anonymoustbd/MMMPT.

updated: Fri Mar 18 2022 10:48:22 GMT+0000 (UTC)

published: Fri Mar 18 2022 10:48:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト