Few-Shot Segmentation via Cycle-Consistent Transformer

Gengwei Zhang; Guoliang Kang; Yi Yang; Yunchao Wei

サイクルコンシステントトランスによる少数ショットセグメンテーション

数ショットのセグメンテーションは、エグザンプラがほとんどない新しいクラスにすばやく適応できるセグメンテーションモデルをトレーニングすることを目的としています。従来のトレーニングパラダイムは、サポート画像からの特徴を条件とするクエリ画像で予測を行うことを学ぶことです。以前の方法では、条件付き情報としてサポートイメージのセマンティックレベルのプロトタイプのみを使用していました。これらの方法では、クエリ予測にすべてのピクセル単位のサポート情報を利用することはできませんが、これはセグメンテーションタスクにとって重要です。このホワイトペーパーでは、サポート画像とクエリ画像の間のピクセル単位の関係を利用して、数ショットのセグメンテーションタスクを容易にすることに焦点を当てています。ピクセル単位のサポート機能をクエリ機能に集約するために、新しいCycle-Consistent TRansformer（CyCTR）モジュールを設計します。 CyCTRは、異なる画像の特徴、つまりサポート画像とクエリ画像の間で相互注意を実行します。予期しない無関係なピクセルレベルのサポート機能が存在する可能性があることを確認しました。クロスアテンションを直接実行すると、これらの機能がサポートからクエリに集約され、クエリ機能にバイアスがかかる場合があります。したがって、新しいサイクル一貫性のある注意メカニズムを使用して、有害な可能性のあるサポート機能を除外し、クエリ機能がサポート画像から最も有益なピクセルに注目するように促すことを提案します。すべての数ショットのセグメンテーションベンチマークでの実験は、提案されたCyCTRが以前の最先端の方法と比較して顕著な改善につながることを示しています。具体的には、Pascal-5 ^ iおよびCOCO-20 ^ iデータセットでは、5ショットセグメンテーションで66.6％および45.6％mIoUを達成し、以前の最先端の方法をそれぞれ4.6％および7.1％上回っています。

Few-shot segmentation aims to train a segmentation model that can fast adapt to novel classes with few exemplars. The conventional training paradigm is to learn to make predictions on query images conditioned on the features from support images. Previous methods only utilized the semantic-level prototypes of support images as conditional information. These methods cannot utilize all pixel-wise support information for the query predictions, which is however critical for the segmentation task. In this paper, we focus on utilizing pixel-wise relationships between support and query images to facilitate the few-shot segmentation task. We design a novel Cycle-Consistent TRansformer (CyCTR) module to aggregate pixel-wise support features into query ones. CyCTR performs cross-attention between features from different images, i.e. support and query images. We observe that there may exist unexpected irrelevant pixel-level support features. Directly performing cross-attention may aggregate these features from support to query and bias the query features. Thus, we propose using a novel cycle-consistent attention mechanism to filter out possible harmful support features and encourage query features to attend to the most informative pixels from support images. Experiments on all few-shot segmentation benchmarks demonstrate that our proposed CyCTR leads to remarkable improvement compared to previous state-of-the-art methods. Specifically, on Pascal-5^i and COCO-20^i datasets, we achieve 66.6% and 45.6% mIoU for 5-shot segmentation, outperforming previous state-of-the-art methods by 4.6% and 7.1% respectively.

updated: Tue Dec 21 2021 07:24:53 GMT+0000 (UTC)

published: Fri Jun 04 2021 07:57:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト