CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

Huayao Liu; Jiaming Zhang; Kailun Yang; Xinxin Hu; Rainer Stiefelhagen

CMX：トランスフォーマーを使用したRGB-Xセマンティックセグメンテーションのクロスモーダルフュージョン

RGB画像のピクセル単位のセマンティックセグメンテーションは、補足モダリティからの有益な機能を活用することで高度化できます。この作業では、RGB-XセマンティックセグメンテーションのためのビジョントランスフォーマーベースのクロスモーダルフュージョンフレームワークであるCMXを提案します。さまざまな補足や不確実性を含むさまざまなセンシングモダリティに一般化するには、包括的なクロスモーダル相互作用を提供する必要があると考えています。 CMXは、RGB画像と補完モダリティ（Xモダリティ）から特徴を抽出するための2つのストリームで構築されています。各特徴抽出段階で、クロスモーダル特徴修正モジュール（CM-FRM）を設計して、他のモダリティの特徴を空間的およびチャネルごとの次元で組み合わせることにより、現在のモダリティの特徴を調整します。修正された機能ペアを使用して、機能融合モジュール（FFM）を展開し、最終的なセマンティック予測のためにそれらを混合します。 FFMは、クロスアテンションメカニズムで構築されています。これにより、長距離コンテキストの交換が可能になり、グローバルレベルで両方のモダリティの機能が強化されます。広範な実験により、CMXは多様なマルチモーダルの組み合わせに一般化され、RGB-ThermalおよびRGB-Polarizationデータセットだけでなく、5つのRGB-Depthベンチマークで最先端のパフォーマンスを達成することが示されています。さらに、密スパースデータ融合への一般化可能性を調査するために、CMXが新しい最先端を設定するEventScapeデータセットに基づいてRGB-Eventセマンティックセグメンテーションベンチマークを確立します。コードはhttps://github.com/huaaaliu/RGBX_Semantic_Segmentationで入手できます。

Pixel-wise semantic segmentation of RGB images can be advanced by exploiting informative features from supplementary modalities. In this work, we propose CMX, a vision-transformer-based cross-modal fusion framework for RGB-X semantic segmentation. To generalize to different sensing modalities encompassing various supplements and uncertainties, we consider that comprehensive cross-modal interactions should be provided. CMX is built with two streams to extract features from RGB images and the complementary modality (X-modality). In each feature extraction stage, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate the feature of the current modality by combining the feature from the other modality, in spatial- and channel-wise dimensions. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to mix them for the final semantic prediction. FFM is constructed with a cross-attention mechanism, which enables exchange of long-range contexts, enhancing both modalities' features at a global level. Extensive experiments show that CMX generalizes to diverse multi-modal combinations, achieving state-of-the-art performances on five RGB-Depth benchmarks, as well as RGB-Thermal and RGB-Polarization datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish an RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. Code is available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation.

updated: Tue Apr 12 2022 13:37:24 GMT+0000 (UTC)

published: Wed Mar 09 2022 16:12:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト