High-Resolution Synthetic RGB-D Datasets for Monocular Depth Estimation

Aakash Rajpal; Noshaba Cheema; Klaus Illgner-Fehns; Philipp Slusallek; Sunil Jaiswal

単眼深度推定のための高解像度合成 RGB-D データセット

正確な深度マップは、自動運転、シーンの再構成、点群の作成など、さまざまなアプリケーションで不可欠です。ただし、単眼深度推定 (MDE) アルゴリズムは、十分なテクスチャとシャープネスを提供できないことが多く、均一なシーンに対しても一貫性がありません。 .これらのアルゴリズムは主に、教師ありトレーニング用の大規模なデータセットを必要とする CNN またはビジョントランスフォーマーベースのアーキテクチャを使用します。しかし、利用可能な深度データセットでトレーニングされた MDE アルゴリズムは一般化されていないため、さまざまな現実世界のシーンで正確に実行できません。さらに、グラウンドトゥルースの深度マップは、解像度が低いかまばらであるため、比較的一貫性のない深度マップになります。一般に、正確な深度予測のためにピクセルレベルの精度で高解像度のグラウンドトゥルースデータセットを取得することは、費用と時間がかかる課題です。この論文では、Grand Theft Auto (GTA-V) から次元 1920 X 1080 の高解像度の合成深度データセット (HRSD) を生成します。これには、100,000 のカラー画像と対応する高密度のグラウンドトゥルース深度マップが含まれます。生成されたデータセットは多様で、均質な表面からテクスチャまで、屋内から屋外までのシーンがあります。実験と分析のために、提案された合成データセットで最先端のトランスフォーマーベースの MDE アルゴリズムである DPT アルゴリズムをトレーニングします。これにより、さまざまなシーンの深度マップの精度が 9% 大幅に向上します。合成データセットは解像度が高いため、Transformer エンコーダーに特徴抽出モジュールを追加し、注意ベースの損失を組み込むことで、精度をさらに 15% 向上させることを提案します。

Accurate depth maps are essential in various applications, such as autonomous driving, scene reconstruction, point-cloud creation, etc. However, monocular-depth estimation (MDE) algorithms often fail to provide enough texture & sharpness, and also are inconsistent for homogeneous scenes. These algorithms mostly use CNN or vision transformer-based architectures requiring large datasets for supervised training. But, MDE algorithms trained on available depth datasets do not generalize well and hence fail to perform accurately in diverse real-world scenes. Moreover, the ground-truth depth maps are either lower resolution or sparse leading to relatively inconsistent depth maps. In general, acquiring a high-resolution ground truth dataset with pixel-level precision for accurate depth prediction is an expensive, and time-consuming challenge. In this paper, we generate a high-resolution synthetic depth dataset (HRSD) of dimension 1920 X 1080 from Grand Theft Auto (GTA-V), which contains 100,000 color images and corresponding dense ground truth depth maps. The generated datasets are diverse and have scenes from indoors to outdoors, from homogeneous surfaces to textures. For experiments and analysis, we train the DPT algorithm, a state-of-the-art transformer-based MDE algorithm on the proposed synthetic dataset, which significantly increases the accuracy of depth maps on different scenes by 9 %. Since the synthetic datasets are of higher resolution, we propose adding a feature extraction module in the transformer encoder and incorporating an attention-based loss, further improving the accuracy by 15 %.

updated: Tue May 02 2023 19:03:08 GMT+0000 (UTC)

published: Tue May 02 2023 19:03:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト