Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net

Yu Qiu; Yun Liu; Le Zhang; Jing Xu

Transformerベースの非対称バイラテラルU-Netによる顕著なオブジェクト検出の強化

既存の顕著なオブジェクト検出（SOD）メソッドは、主に、スキップ接続を備えたCNNベースのU字型構造に依存して、顕著なオブジェクトの検索とオブジェクトの詳細の調整にそれぞれ重要なグローバルコンテキストとローカル空間の詳細を組み合わせます。大成功にもかかわらず、グローバルコンテキストの学習におけるCNNの能力は限られています。最近、ビジョントランスフォーマーは、グローバルな依存関係の強力なモデリングにより、コンピュータービジョンの革新的な進歩を遂げました。ただし、トランスフォーマーにはローカル空間表現を学習する機能がないため、トランスフォーマーをSODに直接適用することは最適ではありません。この目的のために、このペーパーでは、SODのグローバル表現とローカル表現の両方を学習するために、トランスフォーマーとCNNの組み合わせについて説明します。トランスベースの非対称バイラテラルU-Net（ABiU-Net）を提案します。非対称バイラテラルエンコーダーには、トランスフォーマーパスと軽量CNNパスがあり、2つのパスが各エンコーダーステージで通信して、それぞれ補完的なグローバルコンテキストとローカル空間の詳細を学習します。非対称バイラテラルデコーダーは、トランスフォーマーとCNNエンコーダーパスからの特徴を処理する2つのパスで構成され、各デコーダーステージで通信して、粗い顕著なオブジェクトの場所と詳細なオブジェクトの詳細をそれぞれデコードします。 2つのエンコーダー/デコーダーパス間のこのような通信により、AbiU-Netは、トランスフォーマーとCNNのそれぞれの自然な特性を利用して、補完的なグローバル表現とローカル表現を学習できます。したがって、ABiU-Netは、変圧器ベースのSODに新しい視点を提供します。広範な実験により、ABiU-Netは以前の最先端のSODメソッドに対して良好に機能することが実証されています。コードがリリースされます。

Existing salient object detection (SOD) methods mainly rely on CNN-based U-shaped structures with skip connections to combine the global contexts and local spatial details that are crucial for locating salient objects and refining object details, respectively. Despite great successes, the ability of CNN in learning global contexts is limited. Recently, the vision transformer has achieved revolutionary progress in computer vision owing to its powerful modeling of global dependencies. However, directly applying the transformer to SOD is suboptimal because the transformer lacks the ability to learn local spatial representations. To this end, this paper explores the combination of transformer and CNN to learn both global and local representations for SOD. We propose a transformer-based Asymmetric Bilateral U-Net (ABiU-Net). The asymmetric bilateral encoder has a transformer path and a lightweight CNN path, where the two paths communicate at each encoder stage to learn complementary global contexts and local spatial details, respectively. The asymmetric bilateral decoder also consists of two paths to process features from the transformer and CNN encoder paths, with communication at each decoder stage for decoding coarse salient object locations and find-grained object details, respectively. Such communication between the two encoder/decoder paths enables AbiU-Net to learn complementary global and local representations, taking advantage of the natural properties of transformer and CNN, respectively. Hence, ABiU-Net provides a new perspective for transformer-based SOD. Extensive experiments demonstrate that ABiU-Net performs favorably against previous state-of-the-art SOD methods. The code will be released.

updated: Mon Aug 30 2021 12:04:40 GMT+0000 (UTC)

published: Tue Aug 17 2021 19:45:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト