Deep Digging into the Generalization of Self-supervised Monocular Depth Estimation

Jinwoo Bae; Sungho Moon; Sunghoon Im

自己教師あり単眼深度推定の一般化の深掘り

自己教師付き単眼深度推定は、最近広く研究されています。ほとんどの作業は、KITTI などのベンチマークデータセットでのパフォーマンスの向上に焦点を当てていますが、一般化パフォーマンスに関するいくつかの実験も提供しています。この論文では、単眼深度推定の一般化に向けて、バックボーンネットワーク (CNN、Transformer、および CNN-Transformer ハイブリッドモデルなど) を調査します。最初に、ネットワークトレーニング中に見られなかった多様な公開データセットで最先端のモデルを評価します。次に、生成したさまざまなテクスチャシフトデータセットを使用して、テクスチャバイアスおよび形状バイアス表現の効果を調べます。トランスフォーマーは強いシェイプバイアスを示し、CNN は強いテクスチャバイアスを行うことがわかります。また、形状に偏ったモデルは、テクスチャに偏ったモデルと比較して、単眼深度推定の一般化パフォーマンスが優れていることもわかりました。これらの観察に基づいて、MonoFormer と呼ばれるマルチレベル適応機能融合モジュールを備えた CNN-Transformer ハイブリッドネットワークを新たに設計します。 MonoFormer の背後にある設計の直感は、トランスフォーマーを採用することで形状のバイアスを増やし、マルチレベル表現を適応的に融合することでトランスフォーマーの弱い局所性バイアスを補うことです。広範な実験により、提案された方法がさまざまな公開データセットで最先端のパフォーマンスを達成することが示されています。また、我々の方法は、競合する方法の中で最高の汎化能力を示しています。

Self-supervised monocular depth estimation has been widely studied recently. Most of the work has focused on improving performance on benchmark datasets, such as KITTI, but has offered a few experiments on generalization performance. In this paper, we investigate the backbone networks (e.g. CNNs, Transformers, and CNN-Transformer hybrid models) toward the generalization of monocular depth estimation. We first evaluate state-of-the-art models on diverse public datasets, which have never been seen during the network training. Next, we investigate the effects of texture-biased and shape-biased representations using the various texture-shifted datasets that we generated. We observe that Transformers exhibit a strong shape bias and CNNs do a strong texture-bias. We also find that shape-biased models show better generalization performance for monocular depth estimation compared to texture-biased models. Based on these observations, we newly design a CNN-Transformer hybrid network with a multi-level adaptive feature fusion module, called MonoFormer. The design intuition behind MonoFormer is to increase shape bias by employing Transformers while compensating for the weak locality bias of Transformers by adaptively fusing multi-level representations. Extensive experiments show that the proposed method achieves state-of-the-art performance with various public datasets. Our method also shows the best generalization ability among the competitive methods.

updated: Sat Nov 19 2022 16:28:05 GMT+0000 (UTC)

published: Mon May 23 2022 06:56:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト