See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Xiujun Shu; Wei Wen; Haoqian Wu; Keyu Chen; Yiran Song; Ruizhi Qiao; Bo Ren; Xiao Wang

より細かく、より多くを参照してください: テキストベースの人物検索のための暗黙のモダリティアライメント

テキストベースの人物検索は、テキストの説明に基づいてクエリの人物を見つけることを目的としています。重要なのは、視覚とテキストのモダリティ間の共通の潜在空間マッピングを学習することです。この目標を達成するために、既存の作品は、セグメンテーションを使用して明示的にクロスモーダルアラインメントを取得するか、注意を利用して顕著なアラインメントを探索します。これらの方法には 2 つの欠点があります。1) クロスモーダルアラインメントのラベル付けには時間がかかります。 2) 注意メソッドは、顕著なクロスモーダルアラインメントを調査できますが、いくつかの微妙で価値のあるペアを無視する場合があります。これらの問題を軽減するために、テキストベースの人物検索のための Implicit Visual-Textual (IVT) フレームワークを導入します。以前のモデルとは異なり、IVT は単一のネットワークを利用して両方のモダリティの表現を学習し、視覚とテキストの相互作用に貢献します。きめ細かなアラインメントを調査するために、2 つの暗黙的なセマンティックアラインメントパラダイム、マルチレベルアラインメント (MLA) と双方向マスクモデリング (BMM) をさらに提案します。 MLA モジュールは、文、句、および単語レベルでのより細かいマッチングを調査しますが、BMM モジュールは、視覚的モダリティとテキストモダリティの間のより意味的な整合をマイニングすることを目的としています。公開データセット、つまり、CUHK-PEDES、RSTPReID、および ICFG-PEDES で提案された IVT を評価するために、広範な実験が行われます。明示的なボディパーツのアライメントがなくても、私たちのアプローチは最先端のパフォーマンスを実現します。コードは https://github.com/TencentYoutuResearch/PersonRetrieval-IVT で入手できます。

Text-based person retrieval aims to find the query person based on a textual description. The key is to learn a common latent space mapping between visual-textual modalities. To achieve this goal, existing works employ segmentation to obtain explicitly cross-modal alignments or utilize attention to explore salient alignments. These methods have two shortcomings: 1) Labeling cross-modal alignments are time-consuming. 2) Attention methods can explore salient cross-modal alignments but may ignore some subtle and valuable pairs. To relieve these issues, we introduce an Implicit Visual-Textual (IVT) framework for text-based person retrieval. Different from previous models, IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction. To explore the fine-grained alignment, we further propose two implicit semantic alignment paradigms: multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLA module explores finer matching at sentence, phrase, and word levels, while the BMM module aims to mine more semantic alignments between visual and textual modalities. Extensive experiments are carried out to evaluate the proposed IVT on public datasets, i.e., CUHK-PEDES, RSTPReID, and ICFG-PEDES. Even without explicit body part alignment, our approach still achieves state-of-the-art performance. Code is available at: https://github.com/TencentYoutuResearch/PersonRetrieval-IVT.

updated: Fri Aug 26 2022 03:11:08 GMT+0000 (UTC)

published: Thu Aug 18 2022 03:04:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト