Locate then Segment: A Strong Pipeline for Referring Image Segmentation

Ya Jing; Tao Kong; Wei Wang; Liang Wang; Lei Li; Tieniu Tan

見つけてセグメント化：画像セグメンテーションを参照するための強力なパイプライン

参照画像セグメンテーションは、自然言語表現によって参照されるオブジェクトをセグメント化することを目的としています。以前の方法は通常、参照インスタンスのローカリゼーション情報を明示的にモデル化せずに、視覚言語機能を融合して最終的なセグメンテーションマスクを直接生成する、暗黙的で反復的な機能相互作用メカニズムの設計に焦点を当てています。これらの問題に取り組むために、このタスクを「Locate-Then-Segment」（LTS）スキームに分離することにより、別の観点から見ています。言語表現が与えられると、人々は通常、最初に対応するターゲット画像領域に注意を向け、次にそのコンテキストに基づいてオブジェクトに関する細かいセグメンテーションマスクを生成します。 LTSは、最初に視覚的特徴とテキスト的特徴の両方を抽出して融合し、クロスモーダル表現を取得します。次に、視覚的テキスト特徴にクロスモデルインタラクションを適用して、参照されたオブジェクトを前の位置で特定し、最後にライトを使用してセグメンテーション結果を生成します。 -重みセグメンテーションネットワーク。私たちのLTSはシンプルですが、驚くほど効果的です。 3つの人気のあるベンチマークデータセットでは、LTSは以前のすべての最先端の方法を大幅に上回っています（たとえば、RefCOCO +で+ 3.2％、RefCOCOgで+ 3.4％）。さらに、私たちのモデルは、オブジェクトを明示的に特定することでより解釈しやすくなります。これは、視覚化実験によっても証明されています。このフレームワークは、画像セグメンテーションを参照するための強力なベースラインとして役立つと期待されています。

Referring image segmentation aims to segment the objects referred by a natural language expression. Previous methods usually focus on designing an implicit and recurrent feature interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask without explicitly modeling the localization information of the referent instances. To tackle these problems, we view this task from another perspective by decoupling it into a "Locate-Then-Segment" (LTS) scheme. Given a language expression, people generally first perform attention to the corresponding target image regions, then generate a fine segmentation mask about the object based on its context. The LTS first extracts and fuses both visual and textual features to get a cross-modal representation, then applies a cross-model interaction on the visual-textual features to locate the referred object with position prior, and finally generates the segmentation result with a light-weight segmentation network. Our LTS is simple but surprisingly effective. On three popular benchmark datasets, the LTS outperforms all the previous state-of-the-art methods by a large margin (e.g., +3.2% on RefCOCO+ and +3.4% on RefCOCOg). In addition, our model is more interpretable with explicitly locating the object, which is also proved by visualization experiments. We believe this framework is promising to serve as a strong baseline for referring image segmentation.

updated: Tue Mar 30 2021 12:25:27 GMT+0000 (UTC)

published: Tue Mar 30 2021 12:25:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト