Middle-level Fusion for Lightweight RGB-D Salient Object Detection

Nianchang Huang; Qiang Zhang; Jungong Han

軽量 RGB-D 顕著な物体検出のためのミドルレベル Fusion

既存の軽量 RGB-D 顕著オブジェクト検出 (SOD) モデルのほとんどは、2 ストリーム構造または単一ストリーム構造に基づいています。前者は、最初に 2 つのサブネットワークを使用して RGB 画像と深度画像からそれぞれ単峰性の特徴を抽出し、次にそれらを SOD のために融合します。一方、後者は入力 RGB-D 画像からマルチモーダル機能を直接抽出し、レベル間の補完情報を活用することに重点を置いています。ただし、2 ストリーム構造ベースのモデルは必然的により多くのパラメーターを必要とし、単一ストリーム構造ベースのモデルはモダリティの違いを無視するため、クロスモーダル補完情報を十分に活用できません。これらの問題に対処するために、この論文では、軽量の RGB-D SOD モデルを設計するために中間レベルの融合構造を採用することを提案します。最初に 2 つのサブネットワークを使用して低レベルと中間レベルの単峰性特徴を抽出し、次に融合します。後続のサブネットワークで対応する高レベルのマルチモーダル機能を抽出するための、これらの抽出された中間レベルの単峰性機能。既存のモデルとは異なり、この構造はクロスモーダルの補完情報を効果的に活用し、同時にネットワークのパラメーターを大幅に削減できます。したがって、クロスモーダル補完情報を効果的にキャプチャするための情報認識マルチモーダル機能フュージョン (IMFF) モジュールと軽量の機能レベルおよび意思決定レベル機能フュージョン (LFDF) モジュールを含む、新しい軽量 SOD モデルが設計されています。より少ないパラメーターでさまざまな段階で特徴レベルと決定レベルの顕著性情報を集約するため。提案されたモデルには 390 万のパラメーターしかなく、33 FPS で実行されます。いくつかのベンチマークデータセットの実験結果は、いくつかの最先端の方法に対する提案された方法の有効性と優位性を検証します。

Most existing lightweight RGB-D salient object detection (SOD) models are based on two-stream structure or single-stream structure. The former one first uses two sub-networks to extract unimodal features from RGB and depth images, respectively, and then fuses them for SOD. While, the latter one directly extracts multi-modal features from the input RGB-D images and then focuses on exploiting cross-level complementary information. However, two-stream structure based models inevitably require more parameters and single-stream structure based ones cannot well exploit the cross-modal complementary information since they ignore the modality difference. To address these issues, we propose to employ the middle-level fusion structure for designing lightweight RGB-D SOD model in this paper, which first employs two sub-networks to extract low- and middle-level unimodal features, respectively, and then fuses those extracted middle-level unimodal features for extracting corresponding high-level multi-modal features in the subsequent sub-network. Different from existing models, this structure can effectively exploit the cross-modal complementary information and significantly reduce the network's parameters, simultaneously. Therefore, a novel lightweight SOD model is designed, which contains a information-aware multi-modal feature fusion (IMFF) module for effectively capturing the cross-modal complementary information and a lightweight feature-level and decision-level feature fusion (LFDF) module for aggregating the feature-level and the decision-level saliency information in different stages with less parameters. Our proposed model has only 3.9M parameters and runs at 33 FPS. The experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods.

updated: Sat Jun 05 2021 09:50:02 GMT+0000 (UTC)

published: Fri Apr 23 2021 11:37:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト