Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

Junbum Cha; Jonghwan Mun; Byungseok Roh

画像とテキストのペアのみからオープンワールドのセマンティックセグメンテーション用のテキストグラウンディングマスクを生成する方法の学習

密な注釈なしで画像とテキストのペアのみを使用して、画像内の任意の視覚的概念をセグメント化することを学習することを目的とした、オープンワールドのセマンティックセグメンテーションに取り組みます。既存のオープンワールドセグメンテーション手法は、対照学習 (CL) を採用して多様な視覚的概念を学習し、学習した画像レベルの理解をセグメンテーションタスクに移すことで、目覚ましい進歩を遂げています。ただし、これらの CL ベースの方法では、トレーニング中に画像とテキストの配置のみが考慮されるのに対し、セグメンテーションではテスト中に領域とテキストの配置が必要になるため、トレーニングとテストの不一致が生じます。この論文では、モデルが領域とテキストの配置を直接学習できるようにする、新しいテキストに基づく対照的学習 (TCL) フレームワークを提案しました。私たちの方法は、特定のテキストのセグメンテーションマスクを生成し、マスクされた領域からテキストに基づいた画像埋め込みを抽出し、TCL を介してテキスト埋め込みに合わせます。領域とテキストの配置を直接学習することにより、私たちのフレームワークはモデルが生成されたセグメンテーションマスクの品質を直接改善することを奨励します。さらに、厳密かつ公正な比較のために、広く使用されている 8 つのセマンティックセグメンテーションデータセットを使用した統一評価プロトコルを提示します。 TCL は、すべてのデータセットで大きなマージンを持つ最先端のゼロショットセグメンテーションパフォーマンスを実現します。コードは https://github.com/kakaobrain/tcl で入手できます。

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.

updated: Sun Mar 26 2023 11:16:30 GMT+0000 (UTC)

published: Thu Dec 01 2022 18:59:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト