Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

Chaofan Ma; Yuhuan Yang; Yanfeng Wang; Ya Zhang; Weidi Xie

凍結された視覚言語モデルによるオープン語彙セマンティックセグメンテーション

十分な規模でトレーニングされた場合、自己教師あり学習は、幅広い視覚的または言語理解タスクを解決する顕著な能力を示しています。このホワイトペーパーでは、事前トレーニング済みの基礎モデルを目的のダウンストリームタスク、つまりオープン語彙セマンティックセグメンテーションに適応させるための、シンプルでありながら効果的なアプローチを調査します。この目的のために、以下の貢献を行います。(i) 軽量のトランスフォーマーベースのフュージョンモジュールを備えた Fusioner を導入します。これは、一握りの画像セグメンテーションデータを通じて、凍結された視覚的表現と言語の概念を組み合わせます。結果として、モデルは新しいカテゴリをセグメント化するためのゼロショット転送の機能を獲得します。 (ii) 一般性を失うことなく、視覚のみのモデル (MoCo v3、DINO)、言語のみのモデル (BERT)、 -言語モデル (CLIP)、および提案された融合アプローチが、ユニモーダルデータのコーパスで事前にトレーニングされたものであっても、視覚モデルと言語モデルの任意のペアに効果的であることを示します。（iii）提案されたFusionerの重要なコンポーネントを分析するために徹底的なアブレーション研究を実施し、PASCAL-5iやCOCO-20iなどの標準ベンチマークで評価しながら、既存の最先端モデルを大幅に上回っています。凍結された視覚的および言語的特徴についてのみ訓練されています。 (iv) 視覚言語対応の学習におけるモデルの堅牢性を測定するために、FSS-1000 からのサンプルをモザイク化することによって画像が構築される Mosaic-4 という名前の合成データセットをさらに評価します。 Fusioner は、以前のモデルよりも優れたパフォーマンスを発揮します。

When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. In this paper, we investigate simple, yet effective approaches for adapting the pre-trained foundation models to the downstream task of interest, namely, open-vocabulary semantic segmentation. To this end, we make the following contributions: (i) we introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data. As a consequence, the model gains the capability of zero-shot transfer to segment novel categories; (ii) without loss of generality, we experiment on a broad range of self-supervised models that have been pre-trained with different schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT), visual-language model (CLIP), and show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data; (iii) we conduct thorough ablation studies to analyze the critical components in our proposed Fusioner, while evaluating on standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing state-of-the-art models by a large margin, despite only being trained on frozen visual and language features; (iv) to measure the model's robustness on learning visual-language correspondence, we further evaluate on synthetic dataset, named Mosaic-4, where images are constructed by mosaicking the samples from FSS-1000. Fusioner demonstrates superior performance over previous models.

updated: Thu Oct 27 2022 02:57:26 GMT+0000 (UTC)

published: Thu Oct 27 2022 02:57:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト