Learning Auxiliary Monocular Contexts Helps Monocular 3D Object Detection

Xianpeng Liu; Nan Xue; Tianfu Wu

補助単眼コンテキストの学習は、単眼3Dオブジェクト検出に役立ちます

単眼3Dオブジェクト検出は、入力された単一の2D画像内の3D境界ボックスをローカライズすることを目的としています。これは非常に困難な問題であり、特にトレーニングや推論で追加情報（深度、LIDAR、マルチフレームなど）を利用できない場合は、未解決のままです。この論文は、余分な情報を利用することなく、単眼3Dオブジェクト検出のためのシンプルで効果的な定式化を提案します。単眼3Dオブジェクト検出を支援するために、トレーニングの補助タスクとして単眼コンテキストを学習するMonoConメソッドを紹介します。重要なアイデアは、画像内のオブジェクトの注釈付き3Dバウンディングボックスを使用すると、投影されたコーナーキーポイントや、中心に対する関連するオフセットベクトルなど、トレーニングで利用できる適切に配置された投影2D監視信号の豊富なセットがあるということです。トレーニングの補助タスクとして活用する必要がある2Dバウンディングボックスの例。提案されたMonoConは、高レベルの測度論におけるCramer-Woldの定理によって動機付けられています。実装では、非常にシンプルなエンドツーエンドの設計を利用して、補助単眼コンテキストの学習の有効性を正当化します。これは、ディープニューラルネットワーク（DNN）ベースの機能バックボーン、学習用の回帰ヘッドブランチの数の3つのコンポーネントで構成されます。 3Dバウンディングボックス予測で使用される重要なパラメーター、および補助コンテキストを学習するための回帰ヘッドブランチの数。トレーニング後、推論効率を高めるために、補助コンテキスト回帰ブランチは破棄されます。実験では、提案されたMonoConはKITTIベンチマーク（車、歩行者、サイクリスト）でテストされます。これは、自動車カテゴリーのリーダーボードのすべての先行技術を上回り、精度の点で歩行者とサイクリストで同等のパフォーマンスを実現します。シンプルな設計のおかげで、提案されたMonoConメソッドは、比較で38.7fpsの最速の推論速度を取得します。

Monocular 3D object detection aims to localize 3D bounding boxes in an input single 2D image. It is a highly challenging problem and remains open, especially when no extra information (e.g., depth, lidar and/or multi-frames) can be leveraged in training and/or inference. This paper proposes a simple yet effective formulation for monocular 3D object detection without exploiting any extra information. It presents the MonoCon method which learns Monocular Contexts, as auxiliary tasks in training, to help monocular 3D object detection. The key idea is that with the annotated 3D bounding boxes of objects in an image, there is a rich set of well-posed projected 2D supervision signals available in training, such as the projected corner keypoints and their associated offset vectors with respect to the center of 2D bounding box, which should be exploited as auxiliary tasks in training. The proposed MonoCon is motivated by the Cramer-Wold theorem in measure theory at a high level. In implementation, it utilizes a very simple end-to-end design to justify the effectiveness of learning auxiliary monocular contexts, which consists of three components: a Deep Neural Network (DNN) based feature backbone, a number of regression head branches for learning the essential parameters used in the 3D bounding box prediction, and a number of regression head branches for learning auxiliary contexts. After training, the auxiliary context regression branches are discarded for better inference efficiency. In experiments, the proposed MonoCon is tested in the KITTI benchmark (car, pedestrain and cyclist). It outperforms all prior arts in the leaderboard on car category and obtains comparable performance on pedestrian and cyclist in terms of accuracy. Thanks to the simple design, the proposed MonoCon method obtains the fastest inference speed with 38.7 fps in comparisons

updated: Thu Dec 09 2021 00:05:34 GMT+0000 (UTC)

published: Thu Dec 09 2021 00:05:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト