Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Shuanglin Yan; Hao Tang; Liyan Zhang; Jinhui Tang

テキストベースの人物検索における画像固有の情報の抑制と暗黙的なローカル位置合わせ

テキストベースの人物検索 (TBPS) は、クエリテキストが与えられた画像ギャラリーから同じ身元を持つ歩行者の画像を検索することを目的とした難しいタスクです。近年、TBPS は目覚ましい進歩を遂げており、最先端の手法では画像とテキスト間の局所的なきめ細かい対応関係を学習することで優れたパフォーマンスを実現しています。しかし、既存の手法のほとんどは、明示的に生成されたローカル部分に依存してモダリティ間の詳細な対応をモデル化していますが、コンテキスト情報の欠如やノイズの混入の可能性があるため信頼性が低くなります。さらに、既存の方法では、画像固有の情報によって引き起こされるモダリティ間の情報不平等の問題がほとんど考慮されていません。これらの制限に対処するために、我々はTBPS用の効率的な統合マルチレベルアライメントネットワーク(MANet)を提案します。これは、複数のレベルのモダリティ間で位置合わせされた画像/テキスト特徴表現を学習し、高速かつ効果的な人物検索を実現できます。具体的には、最初に画像固有の情報抑制モジュールを設計します。これは、それぞれ関係に基づくローカリゼーションとチャネルアテンションフィルタリングによって画像の背景と環境要因を抑制します。このモジュールは情報の不平等問題を効果的に軽減し、画像とテキスト間の情報量の整合を実現します。第二に、画像/テキストのすべてのピクセル/ワード特徴をモダリティ共有のセマンティックトピックセンターのセットに適応的に集約し、追加の監視やクロスモーダル相互作用なしでモダリティ間の局所的なきめの細かい対応を暗黙的に学習する暗黙的ローカルアライメントモジュールを提案します。。そして、ローカルな視点を補足するものとして、グローバルな調整が導入されます。グローバルおよびローカル調整モジュールの連携により、モダリティ間のセマンティック調整を向上させることができます。複数のデータベースに対する広範な実験により、MANet の有効性と優位性が実証されています。

Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text. In recent years, TBPS has made remarkable progress and state-of-the-art methods achieve superior performance by learning local fine-grained correspondence between images and texts. However, most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities, which is unreliable due to the lack of contextual information or the potential introduction of noise. Moreover, existing methods seldom consider the information inequality problem between modalities caused by image-specific information. To address these limitations, we propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels, and realize fast and effective person search. Specifically, we first design an image-specific information suppression module, which suppresses image background and environmental factors by relation-guided localization and channel attention filtration respectively. This module effectively alleviates the information inequality problem and realizes the alignment of information volume between images and texts. Secondly, we propose an implicit local alignment module to adaptively aggregate all pixel/word features of image/text to a set of modality-shared semantic topic centers and implicitly learn the local fine-grained correspondence between modalities without additional supervision and cross-modal interactions. And a global alignment is introduced as a supplement to the local perspective. The cooperation of global and local alignment modules enables better semantic alignment between modalities. Extensive experiments on multiple databases demonstrate the effectiveness and superiority of our MANet.

updated: Fri Jul 14 2023 03:07:59 GMT+0000 (UTC)

published: Tue Aug 30 2022 16:14:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト