An Efficient Transformer for Simultaneous Learning of BEV and Lane Representations in 3D Lane Detection

Ziye Chen; Kate Smith-Miles; Bo Du; Guoqi Qian; Mingming Gong

3D 車線検出における BEV と車線表現の同時学習のための効率的なトランスフォーマー

3D 空間で車線を正確に検出することは、自動運転にとって非常に重要です。既存の方法は通常、まず逆遠近法マッピング (IPM) を利用して画像ビューの特徴を鳥瞰図 (BEV) に変換し、次に BEV の特徴に基づいて車線を検出します。ただし、IPM は道路の高さの変化を無視するため、ビュー変換が不正確になります。さらに、プロセスの 2 つの別々の段階により、エラーが累積し、複雑さが増す可能性があります。これらの制限に対処するために、3D 車線検出用の効率的な変換器を提案します。バニラのトランスフォーマーとは異なり、私たちのモデルには、車線と BEV の表現を同時に学習するための分解されたクロスアテンションメカニズムが含まれています。このメカニズムは、画像ビューと BEV フィーチャ間のクロスアテンションを、画像ビューと車線フィーチャ間のクロスアテンションと、車線フィーチャと BEV フィーチャ間のクロスアテンションに分解します。どちらもグラウンドトゥルース車線ラインで監視されます。私たちの方法は、車線の特徴を画像ビューとBEVの特徴にそれぞれ適用することにより、2Dと3Dの車線予測を取得します。これにより、ビュー変換は教師付きクロスアテンションでデータから学習されるため、IPM ベースの方法よりも正確なビュー変換が可能になります。さらに、車線機能と BEV 機能の間のクロスアテンションにより、相互に調整できるようになり、2 つの別々の段階よりも正確な車線検出が可能になります。最後に、分解されたクロスアテンションは、元のクロスアテンションよりも効率的です。 OpenLane と ONCE-3DLanes での実験結果は、私たちの手法の最先端のパフォーマンスを示しています。

Accurately detecting lane lines in 3D space is crucial for autonomous driving. Existing methods usually first transform image-view features into bird-eye-view (BEV) by aid of inverse perspective mapping (IPM), and then detect lane lines based on the BEV features. However, IPM ignores the changes in road height, leading to inaccurate view transformations. Additionally, the two separate stages of the process can cause cumulative errors and increased complexity. To address these limitations, we propose an efficient transformer for 3D lane detection. Different from the vanilla transformer, our model contains a decomposed cross-attention mechanism to simultaneously learn lane and BEV representations. The mechanism decomposes the cross-attention between image-view and BEV features into the one between image-view and lane features, and the one between lane and BEV features, both of which are supervised with ground-truth lane lines. Our method obtains 2D and 3D lane predictions by applying the lane features to the image-view and BEV features, respectively. This allows for a more accurate view transformation than IPM-based methods, as the view transformation is learned from data with a supervised cross-attention. Additionally, the cross-attention between lane and BEV features enables them to adjust to each other, resulting in more accurate lane detection than the two separate stages. Finally, the decomposed cross-attention is more efficient than the original one. Experimental results on OpenLane and ONCE-3DLanes demonstrate the state-of-the-art performance of our method.

updated: Thu Jun 08 2023 04:18:31 GMT+0000 (UTC)

published: Thu Jun 08 2023 04:18:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト