Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification

Suncheng Xiang; Jingsheng Gao; Mengyuan Guan; Jiacheng Ruan; Chengfeng Zhou; Ting Liu; Dahong Qian; Yuzhuo Fu

一般化可能な個人の再識別のためのロバストな視覚的セマンティック埋め込みの学習

一般化可能な個人の再識別 (Re-ID) は、機械学習とコンピュータービジョンにおける非常にホットな研究トピックであり、公安やビデオ監視でのさまざまなアプリケーションにより、現実的なシナリオで重要な役割を果たします。ただし、以前の方法は主に視覚的表現の学習に焦点を当てており、トレーニング中にセマンティック機能の可能性を探ることを怠っているため、新しいドメインに適応すると一般化機能が低下しやすくなります。この論文では、ビジュアル、テキスト、およびビジュアル - テキストタスクのそれぞれに対するより堅牢なビジュアル - セマンティック埋め込み学習のために、MMET と呼ばれるマルチモーダル等価トランスフォーマーを提案します。 Transformer のコンテキストで堅牢な機能学習をさらに強化するために、マスクマルチモーダルモデリング戦略 (MMM) と呼ばれる動的マスキングメカニズムを導入して、画像パッチとテキストトークンの両方をマスクします。一般化可能な人物Re-IDのパフォーマンス。ベンチマークデータセットでの広範な実験は、以前のアプローチに対する私たちの方法の競争力のあるパフォーマンスを示しています。この方法が、視覚的意味表現学習に向けた研究を前進させることを願っています。ソースコードは、https://github.com/JeremyXSC/MMET でも公開されています。

Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision, which plays a significant role in realistic scenarios due to its various applications in public security and video surveillance. However, previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which easily leads to poor generalization capability when adapted to the new domain. In this paper, we propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning on visual, textual and visual-textual tasks respectively. To further enhance the robust feature learning in the context of transformer, a dynamic masking mechanism called Masked Multimodal Modeling strategy (MMM) is introduced to mask both the image patches and the text tokens, which can jointly works on multimodal or unimodal data and significantly boost the performance of generalizable person Re-ID. Extensive experiments on benchmark datasets demonstrate the competitive performance of our method over previous approaches. We hope this method could advance the research towards visual-semantic representation learning. Our source code is also publicly available at https://github.com/JeremyXSC/MMET.

updated: Wed Apr 19 2023 08:37:25 GMT+0000 (UTC)

published: Wed Apr 19 2023 08:37:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト