Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark

Xiaofeng Wang; Zheng Zhu; Yunpeng Zhang; Guan Huang; Yun Ye; Wenbo Xu; Ziwei Chen; Xingang Wang

視覚中心の運転ストリーミング認識の準備はできていますか? ASAP ベンチマーク

近年、視覚中心の知覚は、3D 検出、セマンティックマップの構築、モーション予測、深度推定など、さまざまな自動運転タスクで盛んになっています。それにもかかわらず、視覚中心のアプローチの遅延は、実際の展開には高すぎます (たとえば、ほとんどのカメラベースの 3D 検出器の実行時間は 300 ミリ秒を超えます)。理想的な研究と実際のアプリケーションとの間のギャップを埋めるには、パフォーマンスと効率の間のトレードオフを定量化する必要があります。従来、自動運転認識ベンチマークはオフライン評価を実行し、推論時間の遅延を無視していました。この問題を軽減するために、自動運転における視覚中心の認識のオンラインパフォーマンスを評価する最初のベンチマークである Autonomous-driving StreAming Perception (ASAP) ベンチマークを提案します。 2Hz の注釈付き nuScenes データセットに基づいて、最初に 12Hz の生画像の高フレームレートラベルを生成する注釈拡張パイプラインを提案します。実際の展開を参照すると、constRained-computation (SPUR) 評価プロトコルの下でのストリーミング認識がさらに構築され、12Hz 入力がさまざまな計算リソースの制約の下でストリーミング評価に利用されます。 ASAP ベンチマークでは、包括的な実験結果により、さまざまな制約の下でモデルのランクが変化することが明らかになり、実際の展開を最適化するための設計上の選択肢として、モデルのレイテンシと計算予算を考慮する必要があることが示唆されました。さらなる研究を促進するために、カメラベースのストリーミング 3D 検出のベースラインを確立します。これにより、さまざまなハードウェア全体でストリーミングパフォーマンスが一貫して強化されます。 ASAP プロジェクトページ: https://github.com/JeffWang987/ASAP。

In recent years, vision-centric perception has flourished in various autonomous driving tasks, including 3D detection, semantic map construction, motion forecasting, and depth estimation. Nevertheless, the latency of vision-centric approaches is too high for practical deployment (e.g., most camera-based 3D detectors have a runtime greater than 300ms). To bridge the gap between ideal research and real-world applications, it is necessary to quantify the trade-off between performance and efficiency. Traditionally, autonomous-driving perception benchmarks perform the offline evaluation, neglecting the inference time delay. To mitigate the problem, we propose the Autonomous-driving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving. On the basis of the 2Hz annotated nuScenes dataset, we first propose an annotation-extending pipeline to generate high-frame-rate labels for the 12Hz raw images. Referring to the practical deployment, the Streaming Perception Under constRained-computation (SPUR) evaluation protocol is further constructed, where the 12Hz inputs are utilized for streaming evaluation under the constraints of different computational resources. In the ASAP benchmark, comprehensive experiment results reveal that the model rank alters under different constraints, suggesting that the model latency and computation budget should be considered as design choices to optimize the practical deployment. To facilitate further research, we establish baselines for camera-based streaming 3D detection, which consistently enhance the streaming performance across various hardware. ASAP project page: https://github.com/JeffWang987/ASAP.

updated: Sat Dec 17 2022 16:32:15 GMT+0000 (UTC)

published: Sat Dec 17 2022 16:32:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト