Segmenter: Transformer for Semantic Segmentation

Robin Strudel; Ricardo Garcia; Ivan Laptev; Cordelia Schmid

セグメンテーション：セマンティックセグメンテーション用のトランスフォーマー

画像のセグメンテーションは、個々の画像パッチのレベルではあいまいであることが多く、ラベルのコンセンサスに達するにはコンテキスト情報が必要です。この論文では、セマンティックセグメンテーションのトランスフォーマーモデルであるSegmenterを紹介します。畳み込みベースの方法とは対照的に、私たちのアプローチでは、すでに第1層およびネットワーク全体でグローバルコンテキストをモデル化できます。最近のVisionTransformer（ViT）に基づいて構築し、セマンティックセグメンテーションに拡張します。そのために、画像パッチに対応する出力埋め込みに依存し、ポイントワイズリニアデコーダーまたはマスクトランスフォーマーデコーダーを使用して、これらの埋め込みからクラスラベルを取得します。画像分類用に事前にトレーニングされたモデルを活用し、セマンティックセグメンテーションに使用できる中程度のサイズのデータセットでモデルを微調整できることを示します。リニアデコーダーはすでに優れた結果を得ることができますが、クラスマスクを生成するマスクトランスフォーマーによってパフォーマンスをさらに向上させることができます。さまざまなパラメータの影響を示すために、広範なアブレーションスタディを実施します。特に、大きなモデルと小さなパッチサイズの方がパフォーマンスが優れています。セグメンテーションは、セマンティックセグメンテーションで優れた結果を達成します。 ADE20KとPascalContextの両方のデータセットで最先端のパフォーマンスを上回り、Cityscapesで競争力があります。

Image segmentation is often ambiguous at the level of individual image patches and requires contextual information to reach label consensus. In this paper we introduce Segmenter, a transformer model for semantic segmentation. In contrast to convolution-based methods, our approach allows to model global context already at the first layer and throughout the network. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. To do so, we rely on the output embeddings corresponding to image patches and obtain class labels from these embeddings with a point-wise linear decoder or a mask transformer decoder. We leverage models pre-trained for image classification and show that we can fine-tune them on moderate sized datasets available for semantic segmentation. The linear decoder allows to obtain excellent results already, but the performance can be further improved by a mask transformer generating class masks. We conduct an extensive ablation study to show the impact of the different parameters, in particular the performance is better for large models and small patch sizes. Segmenter attains excellent results for semantic segmentation. It outperforms the state of the art on both ADE20K and Pascal Context datasets and is competitive on Cityscapes.

updated: Thu Sep 02 2021 10:36:48 GMT+0000 (UTC)

published: Wed May 12 2021 13:01:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト