BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

Yunpeng Zhang; Zheng Zhu; Wenzhao Zheng; Junjie Huang; Guan Huang; Jie Zhou; Jiwen Lu

BEVerse：鳥の目で見た統一された知覚と予測-ビジョン中心の自動運転のためのビュー

この論文では、マルチカメラシステムに基づく3D知覚と予測のための統合フレームワークであるBEVerseを紹介します。シングルタスクアプローチの改善に焦点を当てた既存の研究とは異なり、BEVerseは、マルチカメラビデオから時空間バードアイビュー（BEV）表現を生成し、ビジョン中心の自動運転のための複数のタスクについて共同で推論する機能を備えています。具体的には、BEVerseは最初に共有特徴抽出とリフティングを実行して、マルチタイムスタンプおよびマルチビュー画像から4DBEV表現を生成します。エゴモーションアライメントの後、時空間エンコーダはBEVでのさらなる特徴抽出に利用されます。最後に、複数のタスクデコーダーが共同の推論と予測のために接続されています。デコーダー内で、グリッドサンプラーを提案して、さまざまなタスクに対してさまざまな範囲と粒度のBEV機能を生成します。また、メモリ効率の高い将来予測のための反復フローの方法を設計します。マルチタスク学習が暗黙的にモーション予測に役立つ一方で、時間情報が3Dオブジェクト検出とセマンティックマップ構築を改善することを示します。 nuScenesデータセットでの広範な実験により、マルチタスクBEVerseが、3Dオブジェクト検出、セマンティックマップ構築、およびモーション予測で既存のシングルタスクメソッドよりも優れていることを示します。シーケンシャルパラダイムと比較して、BEVerseは大幅に改善された効率にも有利です。コードとトレーニング済みモデルはhttps://github.com/zhangyp15/BEVerseでリリースされます。

In this paper, we present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images. After the ego-motion alignment, the spatio-temporal encoder is utilized for further feature extraction in BEV. Finally, multiple task decoders are attached for joint reasoning and prediction. Within the decoders, we propose the grid sampler to generate BEV features with different ranges and granularities for different tasks. Also, we design the method of iterative flow for memory-efficient future prediction. We show that the temporal information improves 3D object detection and semantic map construction, while the multi-task learning can implicitly benefit motion prediction. With extensive experiments on the nuScenes dataset, we show that the multi-task BEVerse outperforms existing single-task methods on 3D object detection, semantic map construction, and motion prediction. Compared with the sequential paradigm, BEVerse also favors in significantly improved efficiency. The code and trained models will be released at https://github.com/zhangyp15/BEVerse.

updated: Thu May 19 2022 17:55:35 GMT+0000 (UTC)

published: Thu May 19 2022 17:55:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト