SLAN: Self-Locator Aided Network for Cross-Modal Understanding

Jiang-Tian Zhai; Qi Zhang; Tong Wu; Xing-Yu Chen; Jiang-Jiang Liu; Bo Ren; Ming-Ming Cheng

SLAN: クロスモーダル理解のためのセルフロケーター支援ネットワーク

視覚と言語の間のきめの細かい相互作用を学ぶことで、VisionLanguage タスクをより正確に理解することができます。ただし、セマンティックアラインメントのテキストに従って重要な画像領域を抽出することは依然として困難です。ほとんどの既存の作品は、フリーズした検出器で得られたテキストにとらわれない冗長な領域によって制限されているか、検出器を事前トレーニングするための乏しいグラウンディング (ゴールド) データに大きく依存しているため、それ以上のスケーリングに失敗しています。これらの問題を解決するために、追加のゴールドデータを使用せずにクロスモーダル理解タスクを行うためのセルフロケーター支援ネットワーク (SLAN) を提案します。 SLAN は、リージョンフィルターとリージョンアダプターで構成され、さまざまなテキストで条件付けされた関心領域をローカライズします。クロスモーダル情報を集約することにより、リージョンフィルターは主要なリージョンを選択し、リージョンアダプターはそれらの座標をテキストガイダンスで更新します。詳細な領域と単語のアラインメントにより、SLAN は多くのダウンストリームタスクに簡単に一般化できます。 5 つのクロスモーダル理解タスクでかなり競争力のある結果を達成します (たとえば、COCO 画像からテキストへの検索とテキストから画像への検索で 85.7% と 69.2% で、以前の SOTA メソッドを上回っています)。 SLAN はまた、2 つのローカリゼーションタスクへの強力なゼロショットおよび微調整された転送可能性を示します。

Learning fine-grained interplay between vision and language allows to a more accurate understanding for VisionLanguage tasks. However, it remains challenging to extract key image regions according to the texts for semantic alignments. Most existing works are either limited by textagnostic and redundant regions obtained with the frozen detectors, or failing to scale further due to its heavy reliance on scarce grounding (gold) data to pre-train detectors. To solve these problems, we propose Self-Locator Aided Network (SLAN) for cross-modal understanding tasks without any extra gold data. SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts. By aggregating cross-modal information, the region filter selects key regions and the region adaptor updates their coordinates with text guidance. With detailed region-word alignments, SLAN can be easily generalized to many downstream tasks. It achieves fairly competitive results on five cross-modal understanding tasks (e.g., 85.7% and 69.2% on COCO image-to-text and text-to-image retrieval, surpassing previous SOTA methods). SLAN also demonstrates strong zero-shot and fine-tuned transferability to two localization tasks.

updated: Thu Dec 08 2022 14:17:46 GMT+0000 (UTC)

published: Mon Nov 28 2022 11:42:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト