Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

Yang Jiao; Zequn Jie; Weixin Luo; Jingjing Chen; Yu-Gang Jiang; Xiaolin Wei; Lin Ma

画像セグメンテーションを参照するための2段階視覚手がかり強化ネットワーク

参照画像セグメンテーション（RIS）は、特定の自然言語表現によって参照される画像からターゲットオブジェクトをセグメント化することを目的としています。多様で柔軟な表現と画像内の複雑な視覚的コンテンツにより、RISモデルは、表現内の単語と画像内に提示されたオブジェクトとの間のきめ細かいマッチング動作を調査する必要性が高くなります。ただし、指示対象（つまり参照対象）の視覚的手がかりが不十分な場合、境界の背景が雑然として混乱したり、画像。そして、不十分な視覚的手がかりの問題は、前の研究で行われたように、クロスモーダル融合メカニズムによって処理することはできません。この論文では、2段階の視覚的手がかり強化ネットワーク（TV-Net）を考案することにより、指示対象の視覚情報を強化するという新しい視点からこの問題に取り組みます。ここでは、新しい検索および強化スキーム（RES）と適応多重-解像度機能フュージョン（AMF）モジュールが提案されています。 2段階の拡張により、提案されたTV-Netは、特に指示対象の視覚情報が不十分な場合に、自然言語表現と画像の間のきめ細かいマッチング動作を学習する際のパフォーマンスが向上し、より優れたセグメンテーション結果が得られます。 RISタスクで提案された方法の有効性を検証するために広範な実験が行われ、提案されたTV-Netは4つのベンチマークデータセットでの最先端のアプローチを上回っています。

Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions as well as complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents with weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work. In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches on four benchmark datasets.

updated: Sat Oct 09 2021 02:53:39 GMT+0000 (UTC)

published: Sat Oct 09 2021 02:53:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト