Self-Supervised Representation Learning for RGB-D Salient Object Detection

Xiaoqi Zhao; Youwei Pang; Lihe Zhang; Huchuan Lu; Xiang Ruan

RGB-D顕著な物体検出のための自己教師あり表現学習

既存のCNNベースのRGB-D顕著オブジェクト検出（SOD）ネットワークはすべて、適切な初期化を提供するのに役立つ階層機能を学習するために、ImageNetで事前トレーニングする必要があります。ただし、大規模なデータセットの収集と注釈付けには、時間と費用がかかります。この論文では、自己教師あり表現学習（SSL）を利用して、2つの口実タスクを設計します。クロスモーダルオートエンコーダーと深度輪郭推定です。私たちの口実タスクは、事前トレーニングを実行するために少数のラベルのないRGB-Dデータセットのみを必要とします。これにより、ネットワークは豊富なセマンティックコンテキストをキャプチャし、2つのモダリティ間のギャップを減らし、ダウンストリームタスクの効果的な初期化を提供します。さらに、RGB-D SODのクロスモーダル融合に固有の問題について、単一の特徴融合をマルチパス融合に分割して、一貫性のある差分情報の適切な認識を実現する一貫性差分集約（CDA）モジュールを提案します。。 CDAモジュールは一般的であり、クロスモーダルとクロスレベルの両方の機能融合に適しています。 6つのベンチマークRGB-DSODデータセットでの広範な実験、RGB-Dデータセットで事前トレーニングされたモデル（注釈なしで6,392）は、ImageNetで事前トレーニングされたほとんどの最先端のRGB-Dメソッドに対して良好に実行できます（画像レベルの注釈付きで1,280,000）。

Existing CNNs-Based RGB-D Salient Object Detection (SOD) networks are all required to be pre-trained on the ImageNet to learn the hierarchy features which can help to provide a good initialization. However, the collection and annotation of large-scale datasets are time-consuming and expensive. In this paper, we utilize Self-Supervised Representation Learning (SSL) to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation. Our pretext tasks require only a few and unlabeled RGB-D datasets to perform pre-training, which makes the network capture rich semantic contexts and reduce the gap between two modalities, thereby providing an effective initialization for the downstream task. In addition, for the inherent problem of cross-modal fusion in RGB-D SOD, we propose a consistency-difference aggregation (CDA) module that splits a single feature fusion into multi-path fusion to achieve an adequate perception of consistent and differential information. The CDA module is general and suitable for both cross-modal and cross-level feature fusion. Extensive experiments on six benchmark RGB-D SOD datasets, our model pre-trained on the RGB-D dataset (6,392 without any annotations) can perform favorably against most state-of-the-art RGB-D methods pre-trained on ImageNet (1,280,000 with image-level annotations).

updated: Wed Apr 14 2021 10:16:15 GMT+0000 (UTC)

published: Fri Jan 29 2021 09:16:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト