Visual Saliency Transformer

Nian Liu; Ni Zhang; Kaiyuan Wan; Junwei Han; Ling Shao

ビジュアルサリエンシートランスフォーマー

最近、大規模な顕著性検出方法は、CNNベースのアーキテクチャに依存することによって有望な結果を達成しています。あるいは、畳み込みのないシーケンス間の観点からこのタスクを再考し、畳み込みでは達成できない長距離の依存関係をモデル化することで顕著性を予測します。具体的には、RGBとRGB-Dの両方の顕著なオブジェクト検出（SOD）のために、純粋なトランスフォーマー、つまりVisual Saliency Transformer（VST）に基づく新しい統合モデルを開発します。画像パッチを入力として受け取り、トランスフォーマーを利用して画像パッチ間でグローバルコンテキストを伝播します。 Vision Transformer（ViT）で使用される従来のトランスフォーマーアーキテクチャとは別に、マルチレベルトークンフュージョンを活用し、トランスフォーマーフレームワークの下で新しいトークンアップサンプリング方法を提案して、高解像度の検出結果を取得します。また、タスク関連のトークンと新しいパッチタスクアテンションメカニズムを導入することにより、顕著性と境界の検出を同時に実行するトークンベースのマルチタスクデコーダーを開発します。実験結果は、私たちのモデルがRGBとRGB-DSODベンチマークデータセットの両方で既存の最先端の結果を上回っていることを示しています。最も重要なことは、フレームワーク全体がSOD分野の新しい視点を提供するだけでなく、変圧器ベースの高密度予測モデルの新しいパラダイムも示していることです。

Recently, massive saliency detection methods have achieved promising results by relying on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Apart from the traditional transformer architecture used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models.

updated: Sun Apr 25 2021 08:24:06 GMT+0000 (UTC)

published: Sun Apr 25 2021 08:24:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト