Linguistic Query-Guided Mask Generation for Referring Image Segmentation

Zhichao Wei; Xiaohao Chen; Mingqiang Chen; Siyu Zhu

画像セグメンテーションを参照するための言語クエリガイド付きマスク生成

画像セグメンテーションの参照は、特定の言語表現に従って関心のある画像領域をセグメント化することを目的としています。これは、典型的なマルチモーダルタスクです。既存の方法は、マスク生成にピクセル分類ベースまたは学習可能なクエリベースのフレームワークを採用していますが、どちらも固定数のパラメトリックプロトタイプを使用してさまざまなテキストと画像のペアを処理するには不十分です。この作業では、トランスフォーマー上に構築されたエンドツーエンドのフレームワークを提案して、LGFormer と呼ばれる言語クエリガイド付きマスク生成を実行します。言語機能をクエリとして表示し、任意の入力画像とテキストのペアに特化したプロトタイプを生成するため、より一貫性のあるセグメンテーション結果が生成されます。さらに、エンコーダーとデコーダーの両方でいくつかのクロスモーダルインタラクションモジュール (たとえば、視覚言語双方向注意モジュール、VLBA) を設計して、より優れたクロスモーダルアラインメントを実現します。

Referring image segmentation aims to segment the image region of interest according to the given language expression, which is a typical multi-modal task. Existing methods either adopt the pixel classification-based or the learnable query-based framework for mask generation, both of which are insufficient to deal with various text-image pairs with a fix number of parametric prototypes. In this work, we propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation, dubbed LGFormer. It views the linguistic features as query to generate a specialized prototype for arbitrary input image-text pair, thus generating more consistent segmentation results. Moreover, we design several cross-modal interaction modules (e.g. , vision-language bidirectional attention module, VLBA) in both encoder and decoder to achieve better cross-modal alignment.

updated: Wed Mar 22 2023 12:01:42 GMT+0000 (UTC)

published: Mon Jan 16 2023 13:38:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト