Recursive Training for Zero-Shot Semantic Segmentation

Ce Wang; Moshiur Farazi; Nick Barnes

ゼロショットセマンティックセグメンテーションのための再帰的トレーニング

汎用セマンティックセグメンテーションは、バックボーンCNNネットワークに依存して、各画像ピクセルを「表示された」オブジェクトクラス（つまり、トレーニング中に使用可能なオブジェクトクラス）または背景クラスに分類するのに役立つ識別機能を抽出します。ゼロショットセマンティックセグメンテーションは、これまでに見たことのないオブジェクトクラスに属する画像ピクセルを識別するためのコンピュータビジョンモデルを必要とする困難なタスクです。「見えない」クラスの画像ピクセルを背景から分離するための汎用セマンティックセグメンテーションモデルを装備することは、未解決の課題のままです。最近のいくつかのモデルは、ゼロショット設定のセマンティックセグメンテーションモデルの最終的なピクセル分類レイヤーを微調整することでこの問題に取り組んでいますが、監視が不足しているため、識別機能を学習するのに苦労しています。疑似特徴表現を使用して、ゼロショット設定のセマンティックセグメンテーションモデルの再トレーニングを監視するための再帰的トレーニングスキームを提案します。この目的のために、ピクセル分類層の信頼性の高い出力を疑似特徴表現として重み付けし、それをジェネレーターにフィードバックするゼロショット最大平均不一致（ZS-MMD）損失を提案します。ジェネレーター側でループを閉じることにより、再トレーニング中に監視を提供し、モデルが「見えない」クラスのより識別力のある特徴表現を学習するのに役立ちます。再帰的トレーニングとZS-MMD損失を使用して、提案されたモデルがPascal-VOC2012データセットとPascal-Contextデータセットで最先端のパフォーマンスを達成することを示します。

General purpose semantic segmentation relies on a backbone CNN network to extract discriminative features that help classify each image pixel into a 'seen' object class (ie., the object classes available during training) or a background class. Zero-shot semantic segmentation is a challenging task that requires a computer vision model to identify image pixels belonging to an object class which it has never seen before. Equipping a general purpose semantic segmentation model to separate image pixels of 'unseen' classes from the background remains an open challenge. Some recent models have approached this problem by fine-tuning the final pixel classification layer of a semantic segmentation model for a Zero-Shot setting, but struggle to learn discriminative features due to the lack of supervision. We propose a recursive training scheme to supervise the retraining of a semantic segmentation model for a zero-shot setting using a pseudo-feature representation. To this end, we propose a Zero-Shot Maximum Mean Discrepancy (ZS-MMD) loss that weighs high confidence outputs of the pixel classification layer as a pseudo-feature representation, and feeds it back to the generator. By closing-the-loop on the generator end, we provide supervision during retraining that in turn helps the model learn a more discriminative feature representation for 'unseen' classes. We show that using our recursive training and ZS-MMD loss, our proposed model achieves state-of-the-art performance on the Pascal-VOC 2012 dataset and Pascal-Context dataset.

updated: Fri Feb 26 2021 23:44:16 GMT+0000 (UTC)

published: Fri Feb 26 2021 23:44:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト