Decoupling Zero-Shot Semantic Segmentation

Jian Ding; Nan Xue; Gui-Song Xia; Dengxin Dai

ゼロショットセマンティックセグメンテーションのデカップリング

ゼロショットセマンティックセグメンテーション（ZS3）は、トレーニングで見られなかった新しいカテゴリをセグメント化することを目的としています。既存の作品は、ZS3をピクセルレベルのゼロショット分類問題として定式化し、テキストのみで事前にトレーニングされた言語モデルの助けを借りて、セマンティック知識を表示されたクラスから表示されていないクラスに転送します。単純ですが、ピクセルレベルのZS3の定式化は、画像とテキストのペアで事前にトレーニングされていることが多く、現在視覚タスクの大きな可能性を示している視覚言語モデルを統合する機能が限られていることを示しています。人間がセグメントレベルのセマンティックラベリングを実行することが多いという観察に触発されて、ZS3を2つのサブタスクに分離することを提案します。1）ピクセルをセグメントにグループ化するクラスに依存しないグループ化タスク。 2）セグメントのゼロショット分類タスク。前者のサブタスクはカテゴリ情報を含まず、見えないクラスのグループピクセルに直接転送できます。後者のサブタスクはセグメントレベルで実行され、ZS3の画像とテキストのペア（CLIPなど）で事前にトレーニングされた大規模な視覚言語モデルを活用する自然な方法を提供します。デカップリングの定式化に基づいて、ZegFormerと呼ばれるシンプルで効果的なゼロショットセマンティックセグメンテーションモデルを提案します。これは、ZS3標準ベンチマークの以前の方法を大幅に上回ります。たとえば、PASCAL VOCで35ポイント、COCOで3ポイントです。目に見えないクラスのmIoUに関するもの。コードはhttps://github.com/dingjiansw101/ZegFormerでリリースされます。

Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. Existing works formulate ZS3 as a pixel-level zero-shot classification problem, and transfer semantic knowledge from seen classes to unseen ones with the help of language models pre-trained only with texts. While simple, the pixel-level ZS3 formulation shows the limited capability to integrate vision-language models that are often pre-trained with image-text pairs and currently demonstrate great potential for vision tasks. Inspired by the observation that humans often perform segment-level semantic labeling, we propose to decouple the ZS3 into two sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments. 2) a zero-shot classification task on segments. The former sub-task does not involve category information and can be directly transferred to group pixels for unseen classes. The latter subtask performs at segment-level and provides a natural way to leverage large-scale vision-language models pre-trained with image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we propose a simple and effective zero-shot semantic segmentation model, called ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff in terms of mIoU for unseen classes. Code will be released at https://github.com/dingjiansw101/ZegFormer.

updated: Wed Dec 15 2021 06:21:47 GMT+0000 (UTC)

published: Wed Dec 15 2021 06:21:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト