Spatio-Contextual Deep Network Based Multimodal Pedestrian Detection For Autonomous Driving

Kinjal Dasgupta; Arindam Das; Sudip Das; Ujjwal Bhattacharya; Senthil Yogamani

自動運転のための空間コンテキストディープネットワークベースのマルチモーダル歩行者検出

歩行者検出は、自動運転システムの最も重要なモジュールです。この目的でカメラが一般的に使用されますが、夜間の暗い場所での運転シナリオでは、カメラの品質が大幅に低下します。一方、赤外線カメラの画像の品質は、同様の条件で影響を受けません。この論文は、RGBおよび熱画像を使用した歩行者検出のためのエンドツーエンドのマルチモーダル融合モデルを提案します。その斬新な空間コンテキストディープネットワークアーキテクチャは、マルチモーダル入力を効率的に活用することができます。これは、2つのモダリティから特徴を抽出するための2つの異なる変形可能なResNeXt-50エンコーダーで構成されています。これらの2つのエンコードされた特徴の融合は、グラフアテンションネットワークと特徴融合ユニットのペアのいくつかのグループで構成されるマルチモーダル特徴埋め込みモジュール（MuFEm）内で行われます。その後、MuFEmの最後の特徴融合ユニットの出力は、空間的改良のために2つのCRFに渡されます。機能のさらなる強化は、4つの異なる方向にトラバースする4つのRNNの助けを借りて、チャネルごとの注意とコンテキスト情報の抽出を適用することによって実現されます。最後に、これらの特徴マップは、各歩行者のバウンディングボックスとスコアマップを生成するために、シングルステージデコーダーによって使用されます。提案されたフレームワークの広範な実験を、3つの公開されているマルチモーダル歩行者検出ベンチマークデータセット、つまりKAIST、CVC-14、およびUTokyoで実行しました。それらのそれぞれの結果は、それぞれの最先端のパフォーマンスを改善しました。この作業の概要とその定性的な結果を示す短いビデオは、https：//youtu.be/FDJdSifuuCsで見ることができます。私たちのソースコードは、論文の公開時にリリースされます。

Pedestrian Detection is the most critical module of an Autonomous Driving system. Although a camera is commonly used for this purpose, its quality degrades severely in low-light night time driving scenarios. On the other hand, the quality of a thermal camera image remains unaffected in similar conditions. This paper proposes an end-to-end multimodal fusion model for pedestrian detection using RGB and thermal images. Its novel spatio-contextual deep network architecture is capable of exploiting the multimodal input efficiently. It consists of two distinct deformable ResNeXt-50 encoders for feature extraction from the two modalities. Fusion of these two encoded features takes place inside a multimodal feature embedding module (MuFEm) consisting of several groups of a pair of Graph Attention Network and a feature fusion unit. The output of the last feature fusion unit of MuFEm is subsequently passed to two CRFs for their spatial refinement. Further enhancement of the features is achieved by applying channel-wise attention and extraction of contextual information with the help of four RNNs traversing in four different directions. Finally, these feature maps are used by a single-stage decoder to generate the bounding box of each pedestrian and the score map. We have performed extensive experiments of the proposed framework on three publicly available multimodal pedestrian detection benchmark datasets, namely KAIST, CVC-14, and UTokyo. The results on each of them improved the respective state-of-the-art performance. A short video giving an overview of this work along with its qualitative results can be seen at https://youtu.be/FDJdSifuuCs. Our source code will be released upon publication of the paper.

updated: Sun Nov 21 2021 14:01:35 GMT+0000 (UTC)

published: Wed May 26 2021 17:50:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト