Geometry Uncertainty Projection Network for Monocular 3D Object Detection

Yan Lu; Xinzhu Ma; Lei Yang; Tianzhu Zhang; Yating Liu; Qi Chu; Junjie Yan; Wanli Ouyang

単眼3Dオブジェクト検出のためのジオメトリ不確実性投影ネットワーク

ジオメトリプロジェクションは、単眼3Dオブジェクト検出における強力な深度推定方法です。高さに依存して深さを推定し、深さモデルに数学的事前分布を導入します。しかし、投影プロセスでは、推定高さの誤差が増幅され、出力深度で大きく反映されるという誤差増幅の問題も発生します。この特性は、制御不能な深度推論につながり、トレーニング効率も損ないます。本論文では、推論と訓練の両方の段階でエラー増幅問題に取り組むために、幾何学不確実性投影ネットワーク（GUPネット）を提案します。具体的には、GUPモジュールを提案して、推定深度のジオメトリに基づく不確実性を取得します。これにより、各深度の信頼性が高くなるだけでなく、深度学習にも役立ちます。さらに、トレーニング段階では、エラー増幅によって引き起こされる不安定性を低減するための階層的タスク学習戦略を提案します。この学習アルゴリズムは、提案されたインジケーターによって各タスクの学習状況を監視し、タスク前の状況に応じて、さまざまなタスクに適切な損失の重みを適応的に割り当てます。これに基づいて、各タスクは、事前タスクが十分に学習されたときにのみ学習を開始します。これにより、トレーニングプロセスの安定性と効率を大幅に向上させることができます。広範な実験は、提案された方法の有効性を示しています。全体的なモデルは、既存の方法よりも信頼性の高いオブジェクトの深さを推測でき、KITTIベンチマークの自動車および歩行者カテゴリの3.74％および4.7％AP40だけ、最先端の画像ベースの単眼3D検出器を上回ります。

Geometry Projection is a powerful depth estimation method in monocular 3D object detection. It estimates depth dependent on heights, which introduces mathematical priors into the deep model. But projection process also introduces the error amplification problem, in which the error of the estimated height will be amplified and reflected greatly at the output depth. This property leads to uncontrollable depth inferences and also damages the training efficiency. In this paper, we propose a Geometry Uncertainty Projection Network (GUP Net) to tackle the error amplification problem at both inference and training stages. Specifically, a GUP module is proposed to obtains the geometry-guided uncertainty of the inferred depth, which not only provides high reliable confidence for each depth but also benefits depth learning. Furthermore, at the training stage, we propose a Hierarchical Task Learning strategy to reduce the instability caused by error amplification. This learning algorithm monitors the learning situation of each task by a proposed indicator and adaptively assigns the proper loss weights for different tasks according to their pre-tasks situation. Based on that, each task starts learning only when its pre-tasks are learned well, which can significantly improve the stability and efficiency of the training process. Extensive experiments demonstrate the effectiveness of the proposed method. The overall model can infer more reliable object depth than existing methods and outperforms the state-of-the-art image-based monocular 3D detectors by 3.74% and 4.7% AP40 of the car and pedestrian categories on the KITTI benchmark.

updated: Thu Jul 29 2021 06:59:07 GMT+0000 (UTC)

published: Thu Jul 29 2021 06:59:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト