Activation to Saliency: Forming High-Quality Labels for Completely Unsupervised Salient Object Detection

Huajun Zhou; Peijia Chen; Lingxiao Yang; Jianhuang Lai; Xiaohua Xie

顕著性へのアクティベーション：完全に監視されていない顕著なオブジェクト検出のための高品質ラベルの形成

既存の深層学習ベースの教師なし顕著オブジェクト検出（USOD）メソッドは、教師あり事前トレーニング済みの深層モデルに依存しています。さらに、それらは、高レベルのセマンティック情報を欠いている手作りの機能に基づいて疑似ラベルを生成します。これらの欠点を克服するために、高品質の顕著性キューを効果的に発掘して堅牢な顕著性検出器をトレーニングする、新しい2段階のActivation-to-Saliency（A2S）フレームワークを提案します。この方法では、トレーニング前の段階であっても、手動で注釈を付ける必要がないことに注意してください。最初の段階では、監視なしで事前にトレーニングされたネットワークを変換して、マルチレベルの機能を単一のアクティベーションマップに集約します。ここで、変換されたネットワークのトレーニングを支援するために適応決定境界（ADB）が提案されます。さらに、高品質の疑似ラベルの生成を容易にするために、新しい損失関数が提案されています。第2段階では、顕著性検出器をトレーニングし、疑似ラベルをオンラインで改良するための自己修正学習パラダイム戦略が開発されます。さらに、2つのResidual Attention Module（RAM）を使用して軽量の顕著性検出器を構築し、過剰適合のリスクを大幅に低減します。いくつかのSODベンチマークでの広範な実験により、当社のフレームワークが既存のUSODメソッドと比較して大幅なパフォーマンスを報告していることが証明されています。さらに、3,000枚の画像でフレームワークをトレーニングするには、約1時間かかります。これは、以前の最先端の方法よりも30倍以上高速です。

Existing deep learning-based Unsupervised Salient Object Detection (USOD) methods rely on supervised pre-trained deep models. Moreover, they generate pseudo labels based on hand-crafted features, which lack high-level semantic information. In order to overcome these shortcomings, we propose a new two-stage Activation-to-Saliency (A2S) framework that effectively excavates high-quality saliency cues to train a robust saliency detector. It is worth noting that our method does not require any manual annotation, even in the pre-training phase. In the first stage, we transform an unsupervisedly pre-trained network to aggregate multi-level features to a single activation map, where an Adaptive Decision Boundary (ADB) is proposed to assist the training of the transformed network. Moreover, a new loss function is proposed to facilitate the generation of high-quality pseudo labels. In the second stage, a self-rectification learning paradigm strategy is developed to train a saliency detector and refine the pseudo labels online. In addition, we construct a lightweight saliency detector using two Residual Attention Modules (RAMs) to largely reduce the risk of overfitting. Extensive experiments on several SOD benchmarks prove that our framework reports significant performance compared with existing USOD methods. Moreover, training our framework on 3,000 images consumes about 1 hour, which is over 30× faster than previous state-of-the-art methods.

updated: Fri Dec 24 2021 01:53:24 GMT+0000 (UTC)

published: Tue Dec 07 2021 11:54:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト