Comprehensive Multi-Modal Interactions for Referring Image Segmentation

Kanishk Jain; Vineet Gandhi

画像セグメンテーションを参照するための包括的なマルチモーダルインタラクション

与えられた自然言語の記述に対応するセグメンテーションマップを出力する参照画像セグメンテーション（RIS）を調査します。 RISを効率的に解決するには、各単語と他の単語との関係、画像内の各領域と他の領域の関係、および言語ドメインと視覚ドメイン間のクロスモーダルアラインメントを理解する必要があります。最近の方法の制限要因の1つは、これらの相互作用を同時に処理しないことであると主張します。この目的のために、JRNetと呼ばれる新しいアーキテクチャを提案します。これは、Joint Reasoning Module（JRM）を使用して、モーダル間およびモーダル内の相互作用を同時にキャプチャします。 JRMの出力は、新しいクロスモーダルマルチレベルフュージョン（CMMLF）モジュールを通過します。このモジュールは、ブリッジとして機能する言語機能を通じて視覚階層全体でコンテキスト情報を交換することにより、セグメンテーションマスクをさらに洗練します。徹底的なアブレーション研究を提示し、4つのベンチマークデータセットでアプローチのパフォーマンスを検証し、既存の最先端の方法よりも大幅にパフォーマンスが向上していることを示しています。

We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the given natural language description. To solve RIS efficiently, we need to understand each word's relationship with other words, each region in the image to other regions, and cross-modal alignment between linguistic and visual domains. We argue that one of the limiting factors in the recent methods is that they do not handle these interactions simultaneously. To this end, we propose a novel architecture called JRNet, which uses a Joint Reasoning Module(JRM) to concurrently capture the inter-modal and intra-modal interactions. The output of JRM is passed through a novel Cross-Modal Multi-Level Fusion (CMMLF) module which further refines the segmentation masks by exchanging contextual information across visual hierarchy through linguistic features acting as a bridge. We present thorough ablation studies and validate our approach's performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art methods.

updated: Wed Aug 25 2021 10:01:43 GMT+0000 (UTC)

published: Wed Apr 21 2021 08:45:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト