CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

Wenxuan Wang; Jing Liu; Xingjian He; Yisi Zhang; Chen Chen; Jiachen Shen; Yan Zhang; Jiangyun Li

CM-MaskSD: 画像セグメンテーションを参照するためのクロスモダリティマスク自己蒸留

参照画像セグメンテーション (RIS) は、与えられた自然言語表現に基づいて画像から目的のオブジェクトをセグメント化することを目的とした基本的な視覚言語タスクです。画像とテキストのデータプロパティは本質的に異なるため、既存の方法のほとんどは、きめ細かい視覚言語の位置合わせに向けた複雑な設計を導入するか、必要な緻密な位置合わせが欠如しているため、スケーラビリティの問題や、過剰または過小などの誤ったセグメント化の問題が発生します。セグメンテーション。 RIS タスクで効果的かつ効率的なきめの細かい特徴アライメントを実現するために、自己蒸留と組み合わせたマスクされたマルチモーダルモデリングの可能性を探索し、CM-MaskSD という名前の新しいクロスモダリティマスク自己蒸留フレームワークを提案します。 CLIP モデルから転送された画像とテキストの意味論的位置合わせの知識により、細分化の精度を高めるためのきめの細かいパッチワード特徴量の位置合わせが実現されます。さらに、CM-MaskSD フレームワークは、メインセグメンテーションブランチと導入されたマスクされた自己蒸留ブランチの間で重みを共有し、マルチモーダルな特徴を調整するために無視できるパラメーターのみを導入するため、ほぼパラメーターを使用せずにモデルのパフォーマンスを大幅に向上させることができます。 RIS タスクの 3 つのベンチマークデータセット (RefCOCO、RefCOCO+、G-Ref) での包括的な実験により、提案したフレームワークが以前の最先端の方法よりも優れていることが説得力を持って実証されています。

Referring image segmentation (RIS) is a fundamental vision-language task that intends to segment a desired object from an image based on a given natural language expression. Due to the essentially distinct data properties between image and text, most of existing methods either introduce complex designs towards fine-grained vision-language alignment or lack required dense alignment, resulting in scalability issues or mis-segmentation problems such as over- or under-segmentation. To achieve effective and efficient fine-grained feature alignment in the RIS task, we explore the potential of masked multimodal modeling coupled with self-distillation and propose a novel cross-modality masked self-distillation framework named CM-MaskSD, in which our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment for better segmentation accuracy. Moreover, our CM-MaskSD framework can considerably boost model performance in a nearly parameter-free manner, since it shares weights between the main segmentation branch and the introduced masked self-distillation branches, and solely introduces negligible parameters for coordinating the multimodal features. Comprehensive experiments on three benchmark datasets (i.e. RefCOCO, RefCOCO+, G-Ref) for the RIS task convincingly demonstrate the superiority of our proposed framework over previous state-of-the-art methods.

updated: Wed Feb 14 2024 15:41:53 GMT+0000 (UTC)

published: Fri May 19 2023 07:17:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト