CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

Runsheng Xu; Zhengzhong Tu; Hao Xiang; Wei Shao; Bolei Zhou; Jiaqi Ma

CoBEVT：スパーストランスフォーマーを使用した協調バーズアイビューセマンティックセグメンテーション

バーズアイビュー（BEV）のセマンティックセグメンテーションは、自動運転の空間センシングにおいて重要な役割を果たします。最近の文献はBEVマップの理解に関して大きな進歩を遂げましたが、それらはすべて、複雑な交通シーンでオクルージョンを処理したり遠くの物体を検出したりするのが難しいシングルエージェントカメラベースのシステムに基づいています。車車間（V2V）通信技術により、自動運転車はセンシング情報を共有できるようになり、シングルエージェントシステムと比較して知覚性能と範囲を劇的に向上させることができます。この論文では、BEVマップ予測を協調的に生成できる最初の汎用マルチエージェントマルチカメラ知覚フレームワークであるCoBEVTを提案します。基盤となるTransformerアーキテクチャでマルチビューおよびマルチエージェントデータからカメラ機能を効率的に融合するために、ビューおよびエージェント間でまばらにローカルおよびグローバルな空間的相互作用をキャプチャできる融合アキシャルアテンションまたはFAXモジュールを設計します。 V2V知覚データセットOPV2Vに関する広範な実験は、CoBEVTが協調的なBEVセマンティックセグメンテーションの最先端のパフォーマンスを実現することを示しています。さらに、CoBEVTは、1）シングルエージェントマルチカメラによるBEVセグメンテーション、2）マルチエージェントLiDARシステムによる3Dオブジェクト検出など、他のタスクに一般化できることが示されています。時間推論速度。

Bird's eye view (BEV) semantic segmentation plays a crucial role in spatial sensing for autonomous driving. Although recent literature has made significant progress on BEV map understanding, they are all based on single-agent camera-based systems which are difficult to handle occlusions and detect distant objects in complex traffic scenes. Vehicle-to-Vehicle (V2V) communication technologies have enabled autonomous vehicles to share sensing information, which can dramatically improve the perception performance and range as compared to single-agent systems. In this paper, we propose CoBEVT, the first generic multi-agent multi-camera perception framework that can cooperatively generate BEV map predictions. To efficiently fuse camera features from multi-view and multi-agent data in an underlying Transformer architecture, we design a fused axial attention or FAX module, which can capture sparsely local and global spatial interactions across views and agents. The extensive experiments on the V2V perception dataset, OPV2V, demonstrate that CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation. Moreover, CoBEVT is shown to be generalizable to other tasks, including 1) BEV segmentation with single-agent multi-camera and 2) 3D object detection with multi-agent LiDAR systems, and achieves state-of-the-art performance with real-time inference speed.

updated: Tue Jul 05 2022 17:59:28 GMT+0000 (UTC)

published: Tue Jul 05 2022 17:59:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト