SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via Swin Transformer and Densely Cascaded Network

Dongseok Shim; H. Jin Kim

SwinDepth: Swin Transformer と高密度にカスケードされたネットワークを介した単眼シーケンスを使用した教師なし深度推定

単眼深度推定は、ローカリゼーション、マッピング、3D オブジェクト検出など、さまざまなコンピュータービジョンおよびロボティクスアプリケーションで重要な役割を果たします。最近、学習ベースのアルゴリズムは、教師付きの方法で大量のデータを使用してモデルをトレーニングすることにより、深度推定で大きな成功を収めています。ただし、教師ありトレーニングのために高密度のグラウンドトゥルース深度ラベルを取得することは困難であり、単眼シーケンスを使用した教師なし深度推定が有望な代替手段として浮上しています。残念ながら、教師なし深度推定に関するほとんどの研究では、損失関数またはオクルージョンマスクが調査されており、ConvNet ベースのエンコーダー/デコーダー構造が深度推定のデファクトスタンダードになるというモデルアーキテクチャの変更はほとんどありません。この論文では、画像特徴抽出器として畳み込みのない Swin Transformer を使用して、ネットワークが深さ推定のためにローカルの幾何学的特徴とグローバルな意味的特徴の両方をキャプチャできるようにします。また、トップダウンのカスケード経路を介して、すべての機能マップを異なるスケールの別の機能マップと直接接続する高密度カスケードマルチスケールネットワーク (DCMNet) を提案します。この密にカスケードされた接続により、デコードレイヤー間の相互接続が強化され、高品質のマルチスケール深度出力が生成されます。 2 つの異なるデータセット、KITTI と Make3D での実験は、提案された方法が既存の最先端の教師なしアルゴリズムよりも優れていることを示しています。

Monocular depth estimation plays a critical role in various computer vision and robotics applications such as localization, mapping, and 3D object detection. Recently, learning-based algorithms achieve huge success in depth estimation by training models with a large amount of data in a supervised manner. However, it is challenging to acquire dense ground truth depth labels for supervised training, and the unsupervised depth estimation using monocular sequences emerges as a promising alternative. Unfortunately, most studies on unsupervised depth estimation explore loss functions or occlusion masks, and there is little change in model architecture in that ConvNet-based encoder-decoder structure becomes a de-facto standard for depth estimation. In this paper, we employ a convolution-free Swin Transformer as an image feature extractor so that the network can capture both local geometric features and global semantic features for depth estimation. Also, we propose a Densely Cascaded Multi-scale Network (DCMNet) that connects every feature map directly with another from different scales via a top-down cascade pathway. This densely cascaded connectivity reinforces the interconnection between decoding layers and produces high-quality multi-scale depth outputs. The experiments on two different datasets, KITTI and Make3D, demonstrate that our proposed method outperforms existing state-of-the-art unsupervised algorithms.

updated: Tue Jan 17 2023 06:01:46 GMT+0000 (UTC)

published: Tue Jan 17 2023 06:01:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト