Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search

Chenyang Gao; Guanyu Cai; Xinyang Jiang; Feng Zheng; Jun Zhang; Yifei Gong; Pai Peng; Xiaowei Guo; Xing Sun

テキストベースの人物検索のための実物大表現に対する文脈的非局所的アラインメント

テキストベースの人物検索は、その人物の説明文を使用して、画像ギャラリー内の対象人物を取得することを目的としています。モーダルギャップは識別機能を効果的に抽出することをより困難にするため、これは非常に困難です。さらに、歩行者の画像と説明の両方のクラス間分散は小さいです。したがって、すべてのスケールで視覚的およびテキストの手がかりを揃えるには、包括的な情報が必要です。ほとんどの既存の方法は、単一のスケール内の画像とテキストの間のローカルアラインメント（たとえば、グローバルスケールのみまたは部分スケールのみ）を考慮し、各スケールで個別にアラインメントを構築するだけです。この問題に対処するために、NAFS（つまり、フルスケール表現に対する非ローカルアラインメント）と呼ばれる、すべてのスケールにわたって画像とテキストの特徴を適応的にアラインメントできる方法を提案します。最初に、より良い局所性で実物大の画像特徴を抽出するために、新しい階段ネットワーク構造が提案されます。第二に、異なるスケールでの記述の表現を得るために、局所性に制約のある注意を伴うBERTが提案されます。次に、各スケールで特徴を個別に整列させる代わりに、新しいコンテキストの非局所的注意メカニズムを適用して、すべてのスケールにわたる潜在的な整列を同時に発見します。実験結果は、テキストベースの人物検索データセットで、私たちの方法が最先端の方法をトップ1で5.53％、トップ5で5.35％上回っていることを示しています。コードはhttps://github.com/TencentYoutuResearch/PersonReID-NAFSで入手できます。

Text-based person search aims at retrieving target person in an image gallery using a descriptive sentence of that person. It is very challenging since modal gap makes effectively extracting discriminative features more difficult. Moreover, the inter-class variance of both pedestrian images and descriptions is small. So comprehensive information is needed to align visual and textual clues across all scales. Most existing methods merely consider the local alignment between images and texts within a single scale (e.g. only global scale or only partial scale) then simply construct alignment at each scale separately. To address this problem, we propose a method that is able to adaptively align image and textual features across all scales, called NAFS (i.e.Non-local Alignment over Full-Scale representations). Firstly, a novel staircase network structure is proposed to extract full-scale image features with better locality. Secondly, a BERT with locality-constrained attention is proposed to obtain representations of descriptions at different scales. Then, instead of separately aligning features at each scale, a novel contextual non-local attention mechanism is applied to simultaneously discover latent alignments across all scales. The experimental results show that our method outperforms the state-of-the-art methods by 5.53% in terms of top-1 and 5.35% in terms of top-5 on text-based person search dataset. The code is available at https://github.com/TencentYoutuResearch/PersonReID-NAFS

updated: Fri Jan 08 2021 14:30:07 GMT+0000 (UTC)

published: Fri Jan 08 2021 14:30:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト