Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension

Peihan Miao; Wei Su; Gaoang Wang; Xuewei Li; Xi Li

式理解を参照するためのセルフペースのマルチグレインクロスモーダルインタラクションモデリング

視覚言語タスクにおける重要かつ困難な問題として、表現理解 (REC) を参照するには、一般に、正確な推論を実現するために、視覚的および言語的モダリティの大量のマルチグレイン情報が必要です。さらに、視覚シーンの多様性と言語表現のバリエーションにより、いくつかのハード例は他のものよりもはるかに豊富なマルチグレイン情報を持っています。 REC タスクでは、さまざまなモダリティからマルチグレイン情報を集約し、具体的な例から豊富な知識を抽出する方法が重要です。前述の課題に対処するために、この論文では、セルフペースのマルチグレインクロスモーダルインタラクションモデリングフレームワークを提案します。このフレームワークは、ネットワーク構造と学習メカニズムの革新を通じて、言語から視覚へのローカリゼーション能力を向上させます。具体的には、視覚的および言語的エンコーダーに固有のマルチグレイン情報を効果的に利用する、トランスフォーマーベースのマルチグレインクロスモーダルアテンションを設計します。さらに、サンプルの大きな分散を考慮して、豊富なマルチグレイン情報を含むサンプルのネットワーク学習を適応的に強化するために、セルフペースのサンプル情報学習を提案します。提案されたフレームワークは、RefCOCO、RefCOCO+、RefCOCOg、ReferItGame データセットなどの広く使用されているデータセットに対する最先端の方法よりも大幅に優れており、この方法の有効性を示しています。

As an important and challenging problem in vision-language tasks, referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning. In addition, due to the diversity of visual scenes and the variation of linguistic expressions, some hard examples have much more abundant multi-grained information than others. How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task. To address aforementioned challenges, in this paper, we propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability through innovations in network structure and learning mechanism. Concretely, we design a transformer-based multi-grained cross-modal attention, which effectively utilizes the inherent multi-grained information in visual and linguistic encoders. Furthermore, considering the large variance of samples, we propose a self-paced sample informativeness learning to adaptively enhance the network learning for samples containing abundant multi-grained information. The proposed framework significantly outperforms state-of-the-art methods on widely used datasets, such as RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame datasets, demonstrating the effectiveness of our method.

updated: Tue Mar 12 2024 08:13:27 GMT+0000 (UTC)

published: Thu Apr 21 2022 08:32:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト