Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold

Zijie Wang; Aichun Zhu; Jingyi Xue; Xili Wan; Chao Liu; Tian Wang; Yifeng Li

飛躍する前に見てください: 一貫したクロスモーダル共通多様体を学習することにより、テキストベースの人物検索を改善する

テキストベースの人物検索の中心的な問題は、マルチモーダルデータ間の異種間ギャップをどのように埋めるかということです。多くの以前のアプローチは、クロスモーダル分布コンセンサス予測 (CDCP) 方法に従って、潜在的な共通マニフォールドマッピングパラダイムを学習しようとしています。あるモダリティの分布から特徴を共通多様体にマッピングすると、反対のモダリティの特徴分布は完全に見えなくなります。つまり、構築されたクロスモーダル共通多様体にマルチモーダル機能を埋め込んで整列させるためにクロスモーダル分布コンセンサスを達成する方法は、実際の状況ではなく、モデル自体の経験にすべて依存します。このような方法では、マルチモーダルデータを共通マニホールドに適切に配置できないことは避けられず、最終的には最適な検索パフォーマンスが得られません。この CDCP のジレンマを克服するために、LBUL と呼ばれる新しいアルゴリズムを提案して、テキストベースの人物検索のための一貫したクロスモーダル共通多様体 (C^3M) を学習します。中国のことわざにあるように、私たちの方法の核となる考え方は、「san si er hou xing」、つまり、Look Before you Leap (LBUL) です。 LBUL の共通マニホールドマッピングメカニズムには、ルッキングステップとリーピングステップが含まれます。 CDCP ベースの方法と比較して、LBUL は、ある特定のモダリティからのデータを C^3M に埋め込む前に、ビジュアルモダリティとテキストモダリティの両方の分布特性を考慮して、より強固なクロスモーダル分布コンセンサスを達成し、優れた検索精度を達成します。 2 つのテキストベースの人物検索データセット CUHK-PEDES と RSTPReid で提案された方法を評価します。実験結果は、提案された LBUL が以前の方法よりも優れており、最先端のパフォーマンスを達成することを示しています。

The core problem of text-based person retrieval is how to bridge the heterogeneous gap between multi-modal data. Many previous approaches contrive to learning a latent common manifold mapping paradigm following a cross-modal distribution consensus prediction (CDCP) manner. When mapping features from distribution of one certain modality into the common manifold, feature distribution of the opposite modality is completely invisible. That is to say, how to achieve a cross-modal distribution consensus so as to embed and align the multi-modal features in a constructed cross-modal common manifold all depends on the experience of the model itself, instead of the actual situation. With such methods, it is inevitable that the multi-modal data can not be well aligned in the common manifold, which finally leads to a sub-optimal retrieval performance. To overcome this CDCP dilemma, we propose a novel algorithm termed LBUL to learn a Consistent Cross-modal Common Manifold (C^3M) for text-based person retrieval. The core idea of our method, just as a Chinese saying goes, is to `san si er hou xing', namely, to Look Before yoU Leap (LBUL). The common manifold mapping mechanism of LBUL contains a looking step and a leaping step. Compared to CDCP-based methods, LBUL considers distribution characteristics of both the visual and textual modalities before embedding data from one certain modality into C^3M to achieve a more solid cross-modal distribution consensus, and hence achieve a superior retrieval accuracy. We evaluate our proposed method on two text-based person retrieval datasets CUHK-PEDES and RSTPReid. Experimental results demonstrate that the proposed LBUL outperforms previous methods and achieves the state-of-the-art performance.

updated: Tue Sep 13 2022 07:21:21 GMT+0000 (UTC)

published: Tue Sep 13 2022 07:21:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト