Towards Universal Vision-language Omni-supervised Segmentation

Bowen Dong; Jiaxi Gu; Jianhua Han; Hang Xu; Wangmeng Zuo

ユニバーサルビジョン言語のオムニ教師ありセグメンテーションに向けて

既存のオープンワールドユニバーサルセグメンテーションアプローチは、通常、CLIP と事前計算されたプロポーザルマスクを活用して、オープンワールドセグメンテーションタスクをプロポーザル分類として扱います。ただし、1) これらの作品はエンドツーエンドの方法で普遍的なセグメンテーションを処理できず、2) パノプティックデータセットの規模が限られているため、モノのクラスに対するオープンワールドのセグメンテーション機能が制限されます。この論文では、ビジョン言語オムニ教師ありセグメンテーション (VLOSS) を紹介します。 VLOSS は、CLIP テキストエンコーダーを備えた Mask2Former ユニバーサルセグメンテーションフレームワークから始まります。オープンワールドのセグメンテーション機能を向上させるために、全能教師データ (つまり、パノプティックセグメンテーションデータ、オブジェクト検出データ、画像とテキストのペアデータ) をトレーニングに活用することで、オープンワールドのセグメンテーション機能を強化し、セグメンテーションの精度を向上させます。トレーニング効率を改善し、オムニ教師データの力を完全に解放するために、FPN スタイルのエンコーダー、切り替え可能なトレーニング手法、正の分類損失など、いくつかの高度な手法を提案します。提案された手法を使用したエンドツーエンドのトレーニング方法の恩恵を受けて、VLOSS は、さらに適応することなく、さまざまなオープンワールドのセグメンテーションタスクに適用できます。さまざまなオープンワールドのパノプティックおよびインスタンスセグメンテーションベンチマークに関する実験結果は、VLOSS の有効性を示しています。特に、パラメーターが少ない場合、Swin-Tiny バックボーンを使用した VLOSS は、LVIS v1 データセットのマスク AP に関して、MaskCLIP を最大 2% 上回っています。

Existing open-world universal segmentation approaches usually leverage CLIP and pre-computed proposal masks to treat open-world segmentation tasks as proposal classification. However, 1) these works cannot handle universal segmentation in an end-to-end manner, and 2) the limited scale of panoptic datasets restricts the open-world segmentation ability on things classes. In this paper, we present Vision-Language Omni-Supervised Segmentation (VLOSS). VLOSS starts from a Mask2Former universal segmentation framework with CLIP text encoder. To improve the open-world segmentation ability, we leverage omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability and achieving better segmentation accuracy. To better improve the training efficiency and fully release the power of omni-supervised data, we propose several advanced techniques, i.e., FPN-style encoder, switchable training technique, and positive classification loss. Benefiting from the end-to-end training manner with proposed techniques, VLOSS can be applied to various open-world segmentation tasks without further adaptation. Experimental results on different open-world panoptic and instance segmentation benchmarks demonstrate the effectiveness of VLOSS. Notably, with fewer parameters, our VLOSS with Swin-Tiny backbone surpasses MaskCLIP by ~2% in terms of mask AP on LVIS v1 dataset.

updated: Sun Mar 12 2023 02:57:53 GMT+0000 (UTC)

published: Sun Mar 12 2023 02:57:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト