Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sixiao Zheng; Jiachen Lu; Hengshuang Zhao; Xiatian Zhu; Zekun Luo; Yabiao Wang; Yanwei Fu; Jianfeng Feng; Tao Xiang; Philip H. S. Torr; Li Zhang

トランスフォーマーを使用したシーケンス間の観点からのセマンティックセグメンテーションの再考

最新のセマンティックセグメンテーション手法は、エンコーダ-デコーダアーキテクチャを備えた完全畳み込みネットワーク（FCN）を採用しています。エンコーダーは空間分解能を徐々に低下させ、より大きな受容野でより抽象的な/意味のある視覚的概念を学習します。コンテキストモデリングはセグメンテーションにとって重要であるため、最新の取り組みは、拡張/アトラス畳み込みまたは注意モジュールの挿入のいずれかを通じて、受容野を増やすことに焦点を当てています。ただし、エンコーダ-デコーダベースのFCNアーキテクチャは変更されていません。この論文では、セマンティックセグメンテーションをシーケンス間の予測タスクとして扱うことにより、代替の視点を提供することを目指しています。具体的には、純粋なトランスフォーマー（つまり、畳み込みと解像度の低下なし）を展開して、画像をパッチのシーケンスとしてエンコードします。トランスフォーマーのすべてのレイヤーでモデル化されたグローバルコンテキストを使用して、このエンコーダーを単純なデコーダーと組み合わせて、SEgmentation TRansformer（SETR）と呼ばれる強力なセグメンテーションモデルを提供できます。広範な実験により、SETRはADE20K（50.28％mIoU）、Pascal Context（55.83％mIoU）で最新の技術を実現し、都市の景観で競争力のある結果を達成していることが示されています。特に、提出当日、競争の激しいADE20Kテストサーバーリーダーボードで1位を獲得しました。

Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.

updated: Sun Jul 25 2021 10:44:52 GMT+0000 (UTC)

published: Thu Dec 31 2020 18:55:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト