Scaling Vision-based End-to-End Driving with Multi-View Attention Learning

Yi Xiao; Felipe Codevilla; Diego Porres; Antonio M. Lopez

マルチビュー注意学習によるビジョンベースのエンドツーエンド運転の拡張

エンドツーエンドの運転では、人間の運転デモンストレーションを使用して、模倣学習によって知覚ベースの運転モデルをトレーニングします。このプロセスは車両信号 (ステアリング角度、加速度など) に基づいて監視されますが、余分なコストのかかる監視 (人間によるセンサーデータのラベル付け) は必要ありません。 CILRS は、このようなビジョンベースのエンドツーエンドの運転モデルの代表として、新しい運転モデルと比較するためのベースラインとしてよく使用されます。これまでのところ、一部の最新モデルは、高価なセンサースイートを使用したり、人間がラベル付けした大量のデータをトレーニングに使用したりすることにより、CILRS よりも優れたパフォーマンスを実現しています。性能の違いを考えると、ビジョンベースの純粋なエンドツーエンドのドライビングを追求する価値はないと考える人もいるかもしれない。しかし、コストとメンテナンスを考慮すると、このアプローチには依然として大きな価値と可能性があると私たちは主張します。この論文では、人間からインスピレーションを得た HFOV を誘導バイアスとして使用して高解像度画像を処理することと、適切な注意メカニズムを組み込むことによって CILRS を改良した CIL++ を紹介します。 CIL++ は、開発コストがよりかかるモデルと比較して、競争力のあるパフォーマンスを実現します。私たちは、車両信号のみによって監視され、条件付き模倣学習によって訓練された、強力なビジョンベースの純粋なエンドツーエンドの運転ベースラインとして、CILRS を CIL++ に置き換えることを提案します。

On end-to-end driving, human driving demonstrations are used to train perception-based driving models by imitation learning. This process is supervised on vehicle signals (e.g., steering angle, acceleration) but does not require extra costly supervision (human labeling of sensor data). As a representative of such vision-based end-to-end driving models, CILRS is commonly used as a baseline to compare with new driving models. So far, some latest models achieve better performance than CILRS by using expensive sensor suites and/or by using large amounts of human-labeled data for training. Given the difference in performance, one may think that it is not worth pursuing vision-based pure end-to-end driving. However, we argue that this approach still has great value and potential considering cost and maintenance. In this paper, we present CIL++, which improves on CILRS by both processing higher-resolution images using a human-inspired HFOV as an inductive bias and incorporating a proper attention mechanism. CIL++ achieves competitive performance compared to models which are more costly to develop. We propose to replace CILRS with CIL++ as a strong vision-based pure end-to-end driving baseline supervised by only vehicle signals and trained by conditional imitation learning.

updated: Sat Jul 22 2023 14:01:00 GMT+0000 (UTC)

published: Tue Feb 07 2023 02:14:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト