CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification

Tengfei Liang; Yi Jin; Yajun Gao; Wu Liu; Songhe Feng; Tao Wang; Yidong Li

CMTR：可視赤外線人物の再識別のためのクロスモダリティトランスフォーマー

可視赤外線クロスモダリティの人物の再識別は、異種の可視モダリティと赤外線モダリティの間で同じIDの画像を取得して照合することを目的とした、やりがいのあるReIDタスクです。したがって、このタスクの中核は、これら2つのモダリティ間の大きなギャップを埋めることです。既存の畳み込みニューラルネットワークベースの方法は、主にモダリティの情報の不十分な認識の問題に直面し、パフォーマンスを制限するアイデンティティの優れた識別モダリティ不変埋め込みを学習できません。これらの問題を解決するために、可視赤外線の人の再識別タスクのためのクロスモダリティトランスベースの方法（CMTR）を提案します。これにより、各モダリティの情報を明示的にマイニングし、それに基づいてより優れた識別機能を生成できます。具体的には、モダリティの特性をキャプチャするために、トークンの埋め込みと融合してモダリティの情報をエンコードする、新しいモダリティの埋め込みを設計します。さらに、モダリティ埋め込みの表現を強化し、一致する埋め込みの分布を調整するために、学習したモダリティの情報に基づいてモダリティ対応の強化損失を提案し、クラス内距離を短縮し、クラス間距離を拡大します。私たちの知る限り、これはクロスモダリティの再識別タスクにトランスフォーマーネットワークを適用する最初の作業です。パブリックSYSU-MM01およびRegDBデータセットで広範な実験を実装し、提案されたCMTRモデルのパフォーマンスは、既存の優れたCNNベースの方法を大幅に上回っています。

Visible-infrared cross-modality person re-identification is a challenging ReID task, which aims to retrieve and match the same identity's images between the heterogeneous visible and infrared modalities. Thus, the core of this task is to bridge the huge gap between these two modalities. The existing convolutional neural network-based methods mainly face the problem of insufficient perception of modalities' information, and can not learn good discriminative modality-invariant embeddings for identities, which limits their performance. To solve these problems, we propose a cross-modality transformer-based method (CMTR) for the visible-infrared person re-identification task, which can explicitly mine the information of each modality and generate better discriminative features based on it. Specifically, to capture modalities' characteristics, we design the novel modality embeddings, which are fused with token embeddings to encode modalities' information. Furthermore, to enhance representation of modality embeddings and adjust matching embeddings' distribution, we propose a modality-aware enhancement loss based on the learned modalities' information, reducing intra-class distance and enlarging inter-class distance. To our knowledge, this is the first work of applying transformer network to the cross-modality re-identification task. We implement extensive experiments on the public SYSU-MM01 and RegDB datasets, and our proposed CMTR model's performance significantly surpasses existing outstanding CNN-based methods.

updated: Mon Oct 18 2021 03:12:59 GMT+0000 (UTC)

published: Mon Oct 18 2021 03:12:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト