Feature-Proxy Transformer for Few-Shot Segmentation

Jian-Wei Zhang; Yifan Sun; Yi Yang; Wei Chen

少数ショットセグメンテーション用の機能プロキシトランスフォーマー

少数ショットセグメンテーション (FSS) は、いくつかの注釈付きサポートサンプルが与えられた場合に、新しいクラスに対してセマンティックセグメンテーションを実行することを目的としています。最近の進歩を再考すると、現在の FSS フレームワークは教師ありセグメンテーションフレームワークから大きく逸脱していることがわかります。深い機能を考えると、FSS メソッドは通常、複雑なデコーダを使用して洗練されたピクセル単位のマッチングを実行しますが、教師ありセグメンテーションメソッドは、単純な線形分類ヘッド。デコーダーとそれに対応するパイプラインが複雑なため、このような FSS フレームワークに従うのは簡単ではありません。この論文では、「特徴抽出器 + 線形分類ヘッド」の単純なフレームワークを復活させ、「プロキシ」が線形分類ヘッドのセマンティッククラスを表すベクトルである新しい Feature-Proxy Transformer (FPTrans) メソッドを提案します。 FPTrans には、識別機能と代表的なプロキシを学習するための 2 つのキーポイントがあります。 2) FPTrans は、(単一のプロキシではなく) 複数のローカルバックグラウンドプロキシを使用します。これらの 2 つのキーポイントは、トランスフォーマーのプロンプトメカニズムを使用して、ビジョントランスフォーマーバックボーンに簡単に統合できます。学習した特徴とプロキシを考慮して、FPTrans はセグメンテーションのためにそれらのコサイン類似度を直接比較します。フレームワークは簡単ですが、FPTrans が最先端のデコーダーベースの方法と同等の競争力のある FSS 精度を達成することを示します。

Few-shot segmentation (FSS) aims at performing semantic segmentation on novel classes given a few annotated support samples. With a rethink of recent advances, we find that the current FSS framework has deviated far from the supervised segmentation framework: Given the deep features, FSS methods typically use an intricate decoder to perform sophisticated pixel-wise matching, while the supervised segmentation methods use a simple linear classification head. Due to the intricacy of the decoder and its matching pipeline, it is not easy to follow such an FSS framework. This paper revives the straightforward framework of "feature extractor + linear classification head" and proposes a novel Feature-Proxy Transformer (FPTrans) method, in which the "proxy" is the vector representing a semantic class in the linear classification head. FPTrans has two keypoints for learning discriminative features and representative proxies: 1) To better utilize the limited support samples, the feature extractor makes the query interact with the support features from the bottom to top layers using a novel prompting strategy. 2) FPTrans uses multiple local background proxies (instead of a single one) because the background is not homogeneous and may contain some novel foreground regions. These two keypoints are easily integrated into the vision transformer backbone with the prompting mechanism in the transformer. Given the learned features and proxies, FPTrans directly compares their cosine similarity for segmentation. Although the framework is straightforward, we show that FPTrans achieves competitive FSS accuracy on par with state-of-the-art decoder-based methods.

updated: Thu Oct 13 2022 11:22:27 GMT+0000 (UTC)

published: Thu Oct 13 2022 11:22:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト