Comprehensive Multi-Modal Interactions for Referring Image Segmentation

Kanishk Jain; Vineet Gandhi

画像セグメンテーションを参照するための包括的なマルチモーダルインタラクション

自然言語記述に対応するセグメンテーションマップを出力する参照画像セグメンテーション（RIS）を調査します。 RISに効率的に対処するには、視覚的および言語的モダリティ間で発生する相互作用と、各モダリティ内の相互作用を考慮する必要があります。既存の方法は、さまざまな形式の相互作用を順次計算する（エラーの伝播につながる）か、モーダル内の相互作用を無視するため、制限されています。 Synchronous Multi-Modal Fusion Module（SFM）を介して、3つの相互作用すべてを同時に実行することにより、この制限に対処します。さらに、洗練されたセグメンテーションマスクを作成するために、言語機能が視覚階層全体でのコンテキスト情報の交換を容易にする、新しい階層型クロスモーダル集約モジュール（HCAM）を提案します。徹底的なアブレーション研究を提示し、4つのベンチマークデータセットでアプローチのパフォーマンスを検証し、既存の最先端（SOTA）メソッドよりも大幅なパフォーマンスの向上を示します。

We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intramodal interactions. We address this limitation by performing all three interactions simultaneously through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach's performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art (SOTA) methods.

updated: Sun Aug 14 2022 17:17:05 GMT+0000 (UTC)

published: Wed Apr 21 2021 08:45:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト