PLIP: Language-Image Pre-training for Person Representation Learning

Jialong Zuo; Changqian Yu; Nong Sang; Changxin Gao

PLIP: 人物表現学習のための言語イメージ事前トレーニング

事前トレーニングは、強力な人物の表現を学習するための効果的な手法として登場しました。既存の手法のほとんどは、ImageNet や LUperson などの純粋ビジョンの大規模データセットでの事前トレーニングが顕著なパフォーマンスを達成することを示しています。ただし、視覚情報のみに依存するため、堅牢な明示的な指標が存在しないため、これらの方法では識別的な人物表現を学習することが困難になります。人物の説明に固有のきめ細かい属性指標からインスピレーションを得て、人物表現学習への言語モダリティの導入を検討します。この目的を達成するために、我々は、PLIP と呼ばれる、人物表現学習のための新しい言語画像事前トレーニングフレームワークを提案します。きめの細かいクロスモーダル関連を明示的に構築するために、意味融合画像の色付け、視覚融合属性予測、視覚言語マッチングという 3 つの口実タスクを具体的に設計します。さらに、適切なデータセットが欠如しているため、SYNTH-PEDES という名前の大規模人物データセットを紹介します。このデータセットでは、多様なテキスト記述を合成するためのスタイリッシュな歩行者属性結合キャプション手法が提案されています。私たちは SYNTH-PEDES で PLIP を事前トレーニングし、テキストベースの Re-ID、画像ベースの Re-ID、人物属性認識などの下流タスクにまたがってモデルを評価します。広範な実験により、私たちのモデルがこれらすべてのタスクで既存の手法を大幅に改善するだけでなく、少数ショットおよびドメイン汎化設定でも優れた能力を発揮することが実証されました。コード、データセット、重みは、~https://github.com/Zplusdragon/PLIP でリリースされます。

Pre-training has emerged as an effective technique for learning powerful person representations. Most existing methods have shown that pre-training on pure-vision large-scale datasets like ImageNet and LUPerson has achieved remarkable performance. However, solely relying on visual information, the absence of robust explicit indicators poses a challenge for these methods to learn discriminative person representations. Drawing inspiration from the intrinsic fine-grained attribute indicators of person descriptions, we explore introducing the language modality into person representation learning. To this end, we propose a novel language-image pre-training framework for person representation learning, termed PLIP. To explicitly build fine-grained cross-modal associations, we specifically design three pretext tasks, i.e. semantic-fused image colorization, visual-fused attributes prediction, and vision-language matching. In addition, due to the lack of an appropriate dataset, we present a large-scale person dataset named SYNTH-PEDES, where the Stylish Pedestrian Attributes-union Captioning method is proposed to synthesize diverse textual descriptions. We pre-train PLIP on SYNTH-PEDES and evaluate our model by spanning downstream tasks such as text-based Re-ID, image-based Re-ID, and person attribute recognition. Extensive experiments demonstrate that our model not only significantly improves existing methods on all these tasks, but also shows great ability in the few-shot and domain generalization settings. The code, dataset and weights will be released at~https://github.com/Zplusdragon/PLIP

updated: Mon May 15 2023 06:49:00 GMT+0000 (UTC)

published: Mon May 15 2023 06:49:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト