POAR: Towards Open-World Pedestrian Attribute Recognition

YUE Zhang; Suchen Wang; Shichao Kan; Zhenyu Weng; Yigang Cen; Yap-peng Tan

POAR: オープンワールドの歩行者属性認識に向けて

歩行者属性認識 (PAR) は、監視システムで対象となる歩行者の属性を予測することを目的としています。既存の方法は、定義済みの属性クラスを使用してマルチラベル分類器をトレーニングすることにより、PAR の問題に対処します。しかし、現実世界のすべての歩行者属性を網羅することは不可能です。この問題に取り組むために、新しい歩行者オープン属性認識 (POAR) フレームワークを開発します。私たちの重要なアイデアは、POAR 問題を画像とテキストの検索問題として定式化することです。マスキング戦略を使用して、Transformer ベースの画像エンコーダーを設計します。特定の歩行者の部分 (頭、上半身、下半身、足など) に焦点を当て、対応する属性を視覚的な埋め込みにエンコードするために、一連の属性トークンが導入されています。各属性カテゴリは、自然言語文として記述され、テキストエンコーダーによってエンコードされます。次に、属性の視覚的な埋め込みとテキストの埋め込みの間の類似性を計算して、入力画像の最適な属性の説明を見つけます。属性カテゴリごとに特定の分類子を学習する既存の方法とは異なり、歩行者を部分レベルでモデル化し、目に見えない属性を処理するための検索方法を検討します。最後に、歩行者の画像は複数の属性を含む可能性があるため、マスクされたトークンを使用した多対多のコントラスト (MTMC) 損失がネットワークをトレーニングするために提案されています。オープン属性設定のベンチマークPARデータセットで広範な実験が行われました。結果は、提案された POAR 手法の有効性を検証し、POAR タスクの強力なベースラインを形成できます。

Pedestrian attribute recognition (PAR) aims to predict the attributes of a target pedestrian in a surveillance system. Existing methods address the PAR problem by training a multi-label classifier with predefined attribute classes. However, it is impossible to exhaust all pedestrian attributes in the real world. To tackle this problem, we develop a novel pedestrian open-attribute recognition (POAR) framework. Our key idea is to formulate the POAR problem as an image-text search problem. We design a Transformer-based image encoder with a masking strategy. A set of attribute tokens are introduced to focus on specific pedestrian parts (e.g., head, upper body, lower body, feet, etc.) and encode corresponding attributes into visual embeddings. Each attribute category is described as a natural language sentence and encoded by the text encoder. Then, we compute the similarity between the visual and text embeddings of attributes to find the best attribute descriptions for the input images. Different from existing methods that learn a specific classifier for each attribute category, we model the pedestrian at a part-level and explore the searching method to handle the unseen attributes. Finally, a many-to-many contrastive (MTMC) loss with masked tokens is proposed to train the network since a pedestrian image can comprise multiple attributes. Extensive experiments have been conducted on benchmark PAR datasets with an open-attribute setting. The results verified the effectiveness of the proposed POAR method, which can form a strong baseline for the POAR task.

updated: Sun Mar 26 2023 06:59:23 GMT+0000 (UTC)

published: Sun Mar 26 2023 06:59:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト