Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval

Xianghao Zang; Ge Li; Wei Gao

ビデオベースの歩行者検索用のTransformerの多方向およびマルチスケールピラミッド

ビデオ監視では、歩行者の検索（個人の再識別とも呼ばれます）が重要なタスクです。このタスクは、重複しないカメラから関心のある歩行者を取得することを目的としています。最近、トランスベースのモデルはこのタスクで大きな進歩を遂げました。ただし、これらのモデルは、きめ細かい、部分的に情報に基づいた情報を無視することに依然として苦しんでいます。この論文は、この問題を解決するために、変圧器（PiT）の多方向およびマルチスケールのピラミッドを提案します。変圧器ベースのアーキテクチャでは、各歩行者の画像は多くのパッチに分割されます。次に、これらのパッチがトランスフォーマーレイヤーに送られ、この画像の特徴表現が取得されます。きめ細かい情報を探求するために、この論文では、これらのパッチに垂直分割と水平分割を適用して、異なる方向の人間のパーツを生成することを提案します。これらのパーツは、よりきめ細かい情報を提供します。マルチスケールの特徴表現を融合するために、このペーパーでは、グローバルレベルの情報とさまざまなスケールのローカルレベルの情報を含むピラミッド構造を紹介します。同じビデオからのすべての歩行者画像の特徴ピラミッドが融合されて、最終的な多方向およびマルチスケールの特徴表現が形成されます。 2つの挑戦的なビデオベースのベンチマークであるMARSとiLIDS-VIDの実験結果は、提案されたPiTが最先端のパフォーマンスを達成していることを示しています。広範なアブレーション研究は、提案されたピラミッド構造の優位性を示しています。コードはhttps://git.openi.org.cn/zangxh/PiT.gitで入手できます。

In video surveillance, pedestrian retrieval (also called person re-identification) is a critical task. This task aims to retrieve the pedestrian of interest from non-overlapping cameras. Recently, transformer-based models have achieved significant progress for this task. However, these models still suffer from ignoring fine-grained, part-informed information. This paper proposes a multi-direction and multi-scale Pyramid in Transformer (PiT) to solve this problem. In transformer-based architecture, each pedestrian image is split into many patches. Then, these patches are fed to transformer layers to obtain the feature representation of this image. To explore the fine-grained information, this paper proposes to apply vertical division and horizontal division on these patches to generate different-direction human parts. These parts provide more fine-grained information. To fuse multi-scale feature representation, this paper presents a pyramid structure containing global-level information and many pieces of local-level information from different scales. The feature pyramids of all the pedestrian images from the same video are fused to form the final multi-direction and multi-scale feature representation. Experimental results on two challenging video-based benchmarks, MARS and iLIDS-VID, show the proposed PiT achieves state-of-the-art performance. Extensive ablation studies demonstrate the superiority of the proposed pyramid structure. The code is available at https://git.openi.org.cn/zangxh/PiT.git.

updated: Sat Feb 12 2022 08:22:47 GMT+0000 (UTC)

published: Sat Feb 12 2022 08:22:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト