MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

Yinghui Xing; Song Wang; Shizhou Zhang; Guoqiang Liang; Xiuwei Zhang; Yanning Zhang

MS-DETR: 疎結合融合およびモダリティバランス最適化を備えたマルチスペクトル歩行者検出トランスフォーマー

マルチスペクトル歩行者検出は、可視モダリティと熱モダリティが特に低照度条件下で補完的な情報を提供できるため、多くの 24 時間アプリケーションにとって重要なタスクです。利用可能なマルチスペクトル歩行者検知器のほとんどは非エンドツーエンド検知器に基づいていますが、この論文では、DETR を拡張したエンドツーエンドのマルチスペクトル歩行者検知器である MultiSpectral歩行者 DEtection TRansformer (MS-DETR) を提案します。マルチモーダル検出の分野。 MS-DETR は、2 つのモダリティ固有のバックボーンと Transformer エンコーダ、それに続くマルチモーダル Transformer デコーダで構成され、可視機能と熱機能はマルチモーダル Transformer デコーダで融合されます。マルチモーダル画像間の不整合に十分に抵抗するために、マルチモーダル特徴からいくつかのキーポイントを個別にまばらにサンプリングし、それらを適応的に学習された注意の重みと融合することにより、疎結合融合戦略を設計します。さらに、異なるモダリティだけでなく、異なる歩行者インスタンスも最終的な検出までの信頼スコアが異なる傾向があるという洞察に基づいて、可視および熱デコーダのブランチを保存し、それらのデコーダのブランチを調整する、インスタンスを認識したモダリティのバランスのとれた最適化戦略をさらに提案します。インスタンスごとの動的損失を通じてスロットを予測します。当社のエンドツーエンド MS-DETR は、困難な KAIST、CVC-14、および LLVIP ベンチマークデータセットで優れたパフォーマンスを示します。ソースコードは https://github.com/YinghuiXing/MS-DETR で入手できます。

Multispectral pedestrian detection is an important task for many around-the-clock applications, since the visible and thermal modalities can provide complementary information especially under low light conditions. Most of the available multispectral pedestrian detectors are based on non-end-to-end detectors, while in this paper, we propose MultiSpectral pedestrian DEtection TRansformer (MS-DETR), an end-to-end multispectral pedestrian detector, which extends DETR into the field of multi-modal detection. MS-DETR consists of two modality-specific backbones and Transformer encoders, followed by a multi-modal Transformer decoder, and the visible and thermal features are fused in the multi-modal Transformer decoder. To well resist the misalignment between multi-modal images, we design a loosely coupled fusion strategy by sparsely sampling some keypoints from multi-modal features independently and fusing them with adaptively learned attention weights. Moreover, based on the insight that not only different modalities, but also different pedestrian instances tend to have different confidence scores to final detection, we further propose an instance-aware modality-balanced optimization strategy, which preserves visible and thermal decoder branches and aligns their predicted slots through an instance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code is available at https://github.com/YinghuiXing/MS-DETR .

updated: Sat Nov 11 2023 12:27:50 GMT+0000 (UTC)

published: Wed Feb 01 2023 07:45:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト