Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

Jilan Xu; Junlin Hou; Yuejie Zhang; Rui Feng; Yi Wang; Yu Qiao; Weidi Xie

自然言語教師によるオープン語彙セマンティックセグメンテーションモデルの学習

このホワイトペーパーでは、定義済みのクローズドセットカテゴリではなく、任意のクラスのオブジェクトをセグメント化することを目的としたオープン語彙セマンティックセグメンテーション (OVS) の問題を検討します。主な貢献は次のとおりです。まず、OVSegmentor と呼ばれる OVS 用のトランスフォーマーベースのモデルを提案します。これは、マスクアノテーションを使用せずに、Web クロールされた画像とテキストのペアのみを事前トレーニングに利用します。 OVSegmentor は、スロットアテンションベースのバインドモジュールを介して画像ピクセルを学習可能なグループトークンのセットに組み立て、グループトークンを対応するキャプションの埋め込みに合わせます。次に、トレーニング用の 2 つのプロキシタスク、つまりマスクされたエンティティの補完とイメージ間のマスクの一貫性を提案します。前者は、与えられたグループトークンに基づいてキャプション内のすべてのマスクされたエンティティを推測することを目的としています。これにより、モデルはビジュアルグループとテキストエンティティの間のきめ細かい配置を学習できます。後者は、共有エンティティを含む画像間で一貫したマスク予測を適用し、モデルが視覚的不変性を学習することを促進します。第三に、頻繁に出現するエンティティで CC12M をフィルタリングすることにより、事前トレーニング用の CC4M データセットを構築します。これにより、トレーニング効率が大幅に向上します。第 4 に、3 つのベンチマークデータセット、PASCAL VOC 2012、PASCAL コンテキスト、および COCO オブジェクトでゼロショット転送を実行します。私たちのモデルは、事前トレーニングにわずか 3% のデータ (4M 対 134M) を使用することで、最先端の方法よりも優れたセグメンテーション結果を達成します。コードと事前トレーニング済みのモデルは、将来の研究のためにリリースされます。

In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, which encourages the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on three benchmark datasets, PASCAL VOC 2012, PASCAL Context, and COCO Object. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training. Code and pre-trained models will be released for future research.

updated: Fri Mar 03 2023 04:23:55 GMT+0000 (UTC)

published: Sun Jan 22 2023 13:10:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト