IFSeg: Image-free Semantic Segmentation via Vision-Language Model

Sukmin Yun; Seong Hyeon Park; Paul Hongsuck Seo; Jinwoo Shin

IFSeg: 視覚言語モデルによる画像のないセマンティックセグメンテーション

視覚言語 (VL) 事前トレーニングは、最近、さまざまな視覚タスクにわたる新しい概念 (たとえば、クロスモダリティ転送) におけるその転送可能性と柔軟性のために多くの注目を集めています。ただし、VL 駆動のセグメンテーションは十分に検討されておらず、既存のアプローチには、VL モデルを下流のセグメンテーションタスクに適合させるために、追加のトレーニング画像やセグメンテーションアノテーションを取得するという負担が依然としてあります。このペーパーでは、タスク固有の画像や注釈を使用せずに、ターゲットのセマンティックカテゴリのセットのみを指定してセマンティックセグメンテーションを実行することを目標とする、新しい画像のないセグメンテーションタスクを紹介します。この困難なタスクに取り組むために、IFSeg という造語で提案された方法は、VL 駆動の人工画像セグメンテーションペアを生成し、事前トレーニング済みの VL モデルをセグメンテーションタスクに更新します。ランダムなセマンティックカテゴリの 2D マップと、対応する単語トークンの別のマップを作成することにより、この人工トレーニングデータを構築します。事前にトレーニングされた VL モデルがビジュアルトークンとテキストトークンを、セマンティクスを共有するトークンが近接して配置されている共通スペースに投影する場合、この人工的に生成された単語マップは、そのような VL モデルの実際の画像入力を置き換えることができます。大規模な一連の実験を通じて、私たちのモデルは、この新しいタスクの効果的なベースラインを確立するだけでなく、タスク固有の画像やセグメンテーションマスクなど、より強力な監督に依存する既存の方法と比較して強力なパフォーマンスを示します。コードは https://github.com/alinlab/ifseg で入手できます。

Vision-language (VL) pre-training has recently gained much attention for its transferability and flexibility in novel concepts (e.g., cross-modality transfer) across various visual tasks. However, VL-driven segmentation has been under-explored, and the existing approaches still have the burden of acquiring additional training images or even segmentation annotations to adapt a VL model to downstream segmentation tasks. In this paper, we introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories, but without any task-specific images and annotations. To tackle this challenging task, our proposed method, coined IFSeg, generates VL-driven artificial image-segmentation pairs and updates a pre-trained VL model to a segmentation task. We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens. Given that a pre-trained VL model projects visual and text tokens into a common space where tokens that share the semantics are located closely, this artificially generated word map can replace the real image inputs for such a VL model. Through an extensive set of experiments, our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods that rely on stronger supervision, such as task-specific images and segmentation masks. Code is available at https://github.com/alinlab/ifseg.

updated: Sat Mar 25 2023 08:19:31 GMT+0000 (UTC)

published: Sat Mar 25 2023 08:19:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト