Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net

Yu Qiu; Yun Liu; Le Zhang; Jing Xu

Transformer ベースの非対称双方向 U-Net による顕著な物体検出の強化

既存の顕著なオブジェクト検出 (SOD) メソッドは、主に U 字型の畳み込みニューラルネットワーク (CNN) に依存し、スキップ接続を使用して、グローバルコンテキストとローカル空間の詳細を組み合わせます。これらは、それぞれ顕著なオブジェクトの位置を特定し、オブジェクトの詳細を調整するために重要です。大きな成功にもかかわらず、グローバルコンテキストを学習する CNN の能力は限られています。最近、ビジョントランスフォーマーは、グローバル依存関係の強力なモデリングにより、コンピュータービジョンにおいて革命的な進歩を遂げました。ただし、トランスフォーマーにはローカル空間表現を学習する機能がないため、トランスフォーマーを SOD に直接適用することは最適ではありません。この目的のために、このホワイトペーパーでは、トランスフォーマーと CNN の組み合わせを調べて、SOD のグローバル表現とローカル表現の両方を学習します。変圧器ベースの非対称双方向 U-Net (ABiU-Net) を提案します。非対称バイラテラルエンコーダーには、トランスフォーマーパスと軽量 CNN パスがあり、2 つのパスが各エンコーダーステージで通信して、相補的なグローバルコンテキストとローカル空間の詳細をそれぞれ学習します。非対称バイラテラルデコーダーは、トランスフォーマーと CNN エンコーダーパスからの特徴を処理するための 2 つのパスからも構成され、各デコーダーステージでの通信により、粗い顕著なオブジェクトの位置と細かいオブジェクトの詳細をそれぞれデコードします。 2 つのエンコーダー/デコーダーパス間のこのような通信により、AbiU-Net は、トランスフォーマーと CNN のそれぞれの自然な特性を利用して、相補的なグローバル表現とローカル表現を学習できます。したがって、ABiU-Net は変圧器ベースの SOD に新しい視点を提供します。広範な実験により、ABiU-Net が以前の最先端の SOD メソッドに対して有利に機能することが実証されています。コードは https://github.com/yuqiuyuqiu/ABiU-Net で入手できます。

Existing salient object detection (SOD) methods mainly rely on U-shaped convolution neural networks (CNNs) with skip connections to combine the global contexts and local spatial details that are crucial for locating salient objects and refining object details, respectively. Despite great successes, the ability of CNNs in learning global contexts is limited. Recently, the vision transformer has achieved revolutionary progress in computer vision owing to its powerful modeling of global dependencies. However, directly applying the transformer to SOD is suboptimal because the transformer lacks the ability to learn local spatial representations. To this end, this paper explores the combination of transformers and CNNs to learn both global and local representations for SOD. We propose a transformer-based Asymmetric Bilateral U-Net (ABiU-Net). The asymmetric bilateral encoder has a transformer path and a lightweight CNN path, where the two paths communicate at each encoder stage to learn complementary global contexts and local spatial details, respectively. The asymmetric bilateral decoder also consists of two paths to process features from the transformer and CNN encoder paths, with communication at each decoder stage for decoding coarse salient object locations and fine-grained object details, respectively. Such communication between the two encoder/decoder paths enables AbiU-Net to learn complementary global and local representations, taking advantage of the natural properties of transformers and CNNs, respectively. Hence, ABiU-Net provides a new perspective for transformer-based SOD. Extensive experiments demonstrate that ABiU-Net performs favorably against previous state-of-the-art SOD methods. The code is available at https://github.com/yuqiuyuqiu/ABiU-Net.

updated: Tue Sep 06 2022 02:43:13 GMT+0000 (UTC)

published: Tue Aug 17 2021 19:45:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト