POAR: Towards Open Vocabulary Pedestrian Attribute Recognition

Yue Zhang; Suchen Wang; Shichao Kan; Zhenyu Weng; Yigang Cen; Yap-peng Tan

POAR: オープンな語彙歩行者属性の認識に向けて

歩行者属性認識 (PAR) は、監視システム内の対象歩行者の属性を予測することを目的としています。既存の方法は、事前定義された属性クラスを使用してマルチラベル分類子をトレーニングすることによって PAR 問題に対処します。しかし、現実世界では歩行者の属性をすべて網羅することは不可能です。この問題に取り組むために、私たちは新しい歩行者オープン属性認識 (POAR) フレームワークを開発しました。私たちの重要なアイデアは、POAR 問題を画像テキスト検索問題として定式化することです。マスキング戦略を使用して、Transformer ベースの画像エンコーダーを設計します。一連の属性トークンは、歩行者の特定の部分 (頭、上半身、下半身、足など) に焦点を当て、対応する属性を視覚的な埋め込みにエンコードするために導入されます。各属性カテゴリは自然言語文として記述され、テキストエンコーダによってエンコードされます。次に、属性の視覚的埋め込みとテキスト埋め込み間の類似性を計算して、入力画像に最適な属性説明を見つけます。属性カテゴリごとに特定の分類子を学習する既存の手法とは異なり、歩行者を部品レベルでモデル化し、目に見えない属性を処理するための検索手法を検討します。最後に、歩行者の画像には複数の属性が含まれる可能性があるため、マスクされたトークンを使用した多対多対比 (MTMC) 損失がネットワークをトレーニングするために提案されています。オープン属性設定のベンチマーク PAR データセットに対して広範な実験が実施されました。結果は、POAR タスクの強力なベースラインを形成できる、提案された POAR 手法の有効性を検証しました。私たちのコードは https://github.com/IvyYZ/POAR で入手できます。

Pedestrian attribute recognition (PAR) aims to predict the attributes of a target pedestrian in a surveillance system. Existing methods address the PAR problem by training a multi-label classifier with predefined attribute classes. However, it is impossible to exhaust all pedestrian attributes in the real world. To tackle this problem, we develop a novel pedestrian open-attribute recognition (POAR) framework. Our key idea is to formulate the POAR problem as an image-text search problem. We design a Transformer-based image encoder with a masking strategy. A set of attribute tokens are introduced to focus on specific pedestrian parts (e.g., head, upper body, lower body, feet, etc.) and encode corresponding attributes into visual embeddings. Each attribute category is described as a natural language sentence and encoded by the text encoder. Then, we compute the similarity between the visual and text embeddings of attributes to find the best attribute descriptions for the input images. Different from existing methods that learn a specific classifier for each attribute category, we model the pedestrian at a part-level and explore the searching method to handle the unseen attributes. Finally, a many-to-many contrastive (MTMC) loss with masked tokens is proposed to train the network since a pedestrian image can comprise multiple attributes. Extensive experiments have been conducted on benchmark PAR datasets with an open-attribute setting. The results verified the effectiveness of the proposed POAR method, which can form a strong baseline for the POAR task. Our code is available at https://github.com/IvyYZ/POAR.

updated: Mon Aug 07 2023 14:08:44 GMT+0000 (UTC)

published: Sun Mar 26 2023 06:59:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト