CRIS: CLIP-Driven Referring Image Segmentation

Zhaoqing Wang; Yu Lu; Qiang Li; Xunqiang Tao; Yandong Guo; Mingming Gong; Tongliang Liu

CRIS：CLIP駆動の参照画像セグメンテーション

参照画像のセグメンテーションは、自然な言語表現を介して指示対象をセグメント化することを目的としています。テキストと画像の間のデータプロパティが異なるため、ネットワークがテキストとピクセルレベルの機能を適切に位置合わせすることは困難です。既存のアプローチでは、事前にトレーニングされたモデルを使用して学習を容易にしますが、マルチモーダルの対応する情報を無視して、事前にトレーニングされたモデルから言語/ビジョンの知識を個別に転送します。対照的な言語-画像事前トレーニング（CLIP）の最近の進歩に触発されて、この論文では、エンドツーエンドのCLIP駆動型参照画像セグメンテーションフレームワーク（CRIS）を提案します。マルチモーダル知識を効果的に伝達するために、CRISは、テキストとピクセルの位置合わせを実現するために、視覚言語のデコードと対照的な学習に頼っています。より具体的には、テキスト表現から各ピクセルレベルのアクティベーションにきめ細かいセマンティック情報を伝播するビジョン言語デコーダーを設計します。これにより、2つのモダリティ間の一貫性が促進されます。さらに、テキストとピクセルの対比学習を提示して、関連するピクセルレベルの機能に類似し、無関係なテキスト機能を明示的に適用します。 3つのベンチマークデータセットでの実験結果は、提案されたフレームワークが後処理なしで最先端のパフォーマンスを大幅に上回っていることを示しています。コードがリリースされます。

Referring image segmentation aims to segment a referent via a natural linguistic expression.Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically, we design a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities. In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances. The experimental results on three benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art performance without any post-processing. The code will be released.

updated: Tue Nov 30 2021 07:29:08 GMT+0000 (UTC)

published: Tue Nov 30 2021 07:29:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト