Transformer-based stereo-aware 3D object detection from binocular images

Hanqing Sun; Yanwei Pang; Jiale Cao; Jin Xie; Xuelong Li

両眼画像からの変圧器ベースのステレオ認識 3D オブジェクト検出

ビジョントランスフォーマーは、単眼 2D/3D 検出やサラウンドビュー 3D 検出など、さまざまなオブジェクト検出タスクで有望な進歩を遂げています。ただし、本質的で古典的なステレオ 3D オブジェクト検出で使用する場合、これらのサラウンドビュートランスフォーマーを直接採用すると、収束が遅くなり、精度が大幅に低下します。この欠陥の原因の 1 つは、サラウンドビュートランスフォーマーがステレオ固有の画像対応情報を考慮していないことであると私たちは主張します。サラウンドビューシステムでは、オーバーラップする領域が小さいため、対応は主要な問題ではありません。このホワイトペーパーでは、特にタスク固有の画像対応情報の抽出とエンコードに焦点を当てて、ステレオ 3D オブジェクト検出におけるビジョントランスフォーマーのモデル設計について説明します。この目標を達成するために、Transformer ベースのステレオ対応 3D オブジェクト検出器である TS3D を紹介します。 TS3D では、画像対応情報をステレオ特徴に埋め込むために、Disparity-Aware Positional Encoding (DAPE) モデルが提案されています。対応は、正規化された視差としてエンコードされ、正弦波 2D 位置エンコードと組み合わせて使用され、3D シーンの位置情報を提供します。強化されたマルチスケールステレオ機能を抽出するために、Stereo Reserving Feature Pyramid Network (SRFPN) を提案します。 SRFPN は、イントラスケールを融合し、クロススケールのステレオ機能を集約しながら、対応情報を予約するように設計されています。私たちが提案した TS3D は、KITTI テストセットで 41.29% の中程度の自動車検出平均精度を達成し、各双眼鏡画像ペアからオブジェクトを検出するのに 88 ミリ秒かかります。精度と推論速度の両方の点で、高度なカウンターパートと競合します。

Vision Transformers have shown promising progress in various object detection tasks, including monocular 2D/3D detection and surround-view 3D detection. However, when used in essential and classic stereo 3D object detection, directly adopting those surround-view Transformers leads to slow convergence and significant precision drops. We argue that one of the causes of this defect is that the surround-view Transformers do not consider the stereo-specific image correspondence information. In a surround-view system, the overlapping areas are small, and thus correspondence is not a primary issue. In this paper, we explore the model design of vision Transformers in stereo 3D object detection, focusing particularly on extracting and encoding the task-specific image correspondence information. To achieve this goal, we present TS3D, a Transformer-based Stereo-aware 3D object detector. In the TS3D, a Disparity-Aware Positional Encoding (DAPE) model is proposed to embed the image correspondence information into stereo features. The correspondence is encoded as normalized disparity and is used in conjunction with sinusoidal 2D positional encoding to provide the location information of the 3D scene. To extract enriched multi-scale stereo features, we propose a Stereo Reserving Feature Pyramid Network (SRFPN). The SRFPN is designed to reserve the correspondence information while fusing intra-scale and aggregating cross-scale stereo features. Our proposed TS3D achieves a 41.29% Moderate Car detection average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair. It is competitive with advanced counterparts in terms of both precision and inference speed.

updated: Thu Jun 15 2023 01:56:53 GMT+0000 (UTC)

published: Mon Apr 24 2023 08:29:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト