BEVControl: Accurately Controlling Street-view Elements with Multi-perspective Consistency via BEV Sketch Layout

Kairui Yang; Enhui Ma; Jibin Peng; Qing Guo; Di Lin; Kaicheng Yu

BEVControl: BEV スケッチレイアウトによるマルチパースペクティブの一貫性を備えたストリートビュー要素の正確な制御

合成画像を使用して知覚モデルのパフォーマンスを向上させることは、コンピュータービジョンにおける長年の研究課題です。一部のロングテールシナリオは決して収集できないため、マルチビューカメラを備えた視覚中心の自動運転システムでは、これがより顕著になります。 BEV セグメンテーションレイアウトに基づいて、既存の生成ネットワークは、シーンレベルのメトリクスのみで評価すると、写真のようにリアルなストリートビュー画像を合成しているように見えます。ただし、ズームインすると、通常、方位などの前景と背景の詳細を正確に表示できなくなります。この目的を達成するために、正確な前景コンテンツと背景コンテンツを生成できる、BEVControl と呼ばれる 2 段階の生成方法を提案します。セグメンテーションのような入力とは対照的に、人間がより柔軟に編集できるスケッチスタイルの入力もサポートしています。さらに、生成されたシーン、前景オブジェクト、および背景ジオメトリの品質を公平に比較するための包括的なマルチレベル評価プロトコルを提案します。私たちの広範な実験では、BEVControl が最先端の手法である BEVGen を、フォアグラウンドセグメンテーション mIoU で 5.89 から 26.80 と大幅に上回っていることが示されています。さらに、BEVControl によって生成された画像を使用して下流の知覚モデルをトレーニングすると、NDS スコアが平均 1.29 向上することを示します。

Using synthesized images to boost the performance of perception models is a long-standing research challenge in computer vision. It becomes more eminent in visual-centric autonomous driving systems with multi-view cameras as some long-tail scenarios can never be collected. Guided by the BEV segmentation layouts, the existing generative networks seem to synthesize photo-realistic street-view images when evaluated solely on scene-level metrics. However, once zoom-in, they usually fail to produce accurate foreground and background details such as heading. To this end, we propose a two-stage generative method, dubbed BEVControl, that can generate accurate foreground and background contents. In contrast to segmentation-like input, it also supports sketch style input, which is more flexible for humans to edit. In addition, we propose a comprehensive multi-level evaluation protocol to fairly compare the quality of the generated scene, foreground object, and background geometry. Our extensive experiments show that our BEVControl surpasses the state-of-the-art method, BEVGen, by a significant margin, from 5.89 to 26.80 on foreground segmentation mIoU. In addition, we show that using images generated by BEVControl to train the downstream perception model, it achieves on average 1.29 improvement in NDS score.

updated: Thu Aug 03 2023 09:56:31 GMT+0000 (UTC)

published: Thu Aug 03 2023 09:56:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト