OmniDet: Surround View Cameras based Multi-task Visual Perception Network for Autonomous Driving

Varun Ravi Kumar; Senthil Yogamani; Hazem Rashed; Ganesh Sistu; Christian Witt; Isabelle Leang; Stefan Milz; Patrick Mäder

OmniDet：自動運転のためのサラウンドビューカメラベースのマルチタスク視覚認識ネットワーク

サラウンドビュー魚眼カメラは、通常、車両周辺の360°{}近接場検知用の自動運転に導入されます。この作品は、車両が周囲の環境を感知できるようにするために、修正されていない魚眼画像にマルチタスクの視覚認識ネットワークを提示します。これは、自動運転システムに必要な6つの主要なタスクで構成されています。深度推定、視覚オドメトリ、セマンティックセグメンテーション、モーションセグメンテーション、オブジェクト検出、レンズ汚れ検出です。共同でトレーニングされたモデルは、それぞれの単一タスクバージョンよりもパフォーマンスが優れていることを示します。私たちのマルチタスクモデルには、計算上の大きな利点を提供する共有エンコーダーと、タスクが相互にサポートする相乗効果のあるデコーダーがあります。トレーニングと推論の両方で魚眼歪みモデルをエンコードするための新しいカメラジオメトリベースの適応メカニズムを提案します。これは、異なる本質と視点を持つ3台の異なる車に取り付けられた12台の異なるカメラによって収集された、世界のさまざまな地域からのデータで構成されるWoodScapeデータセットのトレーニングを可能にするために重要でした。バウンディングボックスは歪んだ魚眼画像の適切な表現ではないため、オブジェクト検出を拡張して、頂点が不均一にサンプリングされたポリゴンを使用します。さらに、標準の自動車データセット、つまりKITTIとCityscapesでモデルを評価します。深さ推定とポーズ推定のタスク、およびその他のタスクでの競争力のあるパフォーマンスについて、KITTIで最先端の結果を取得します。さまざまなアーキテクチャの選択とタスクの重み付け方法について、広範なアブレーション調査を実施します。 https://youtu.be/xbSjZ5OfPesの短いビデオは、定性的な結果を提供します。

Surround View fisheye cameras are commonly deployed in automated driving for 360°{} near-field sensing around the vehicle. This work presents a multi-task visual perception network on unrectified fisheye images to enable the vehicle to sense its surrounding environment. It consists of six primary tasks necessary for an autonomous driving system: depth estimation, visual odometry, semantic segmentation, motion segmentation, object detection, and lens soiling detection. We demonstrate that the jointly trained model performs better than the respective single task versions. Our multi-task model has a shared encoder providing a significant computational advantage and has synergized decoders where tasks support each other. We propose a novel camera geometry based adaptation mechanism to encode the fisheye distortion model both at training and inference. This was crucial to enable training on the WoodScape dataset, comprised of data from different parts of the world collected by 12 different cameras mounted on three different cars with different intrinsics and viewpoints. Given that bounding boxes is not a good representation for distorted fisheye images, we also extend object detection to use a polygon with non-uniformly sampled vertices. We additionally evaluate our model on standard automotive datasets, namely KITTI and Cityscapes. We obtain the state-of-the-art results on KITTI for depth estimation and pose estimation tasks and competitive performance on the other tasks. We perform extensive ablation studies on various architecture choices and task weighting methodologies. A short video at https://youtu.be/xbSjZ5OfPes provides qualitative results.

updated: Tue Aug 24 2021 14:45:16 GMT+0000 (UTC)

published: Mon Feb 15 2021 10:46:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト