Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Guanshuo Wang; Fufu Yu; Junjie Li; Qiong Jia; Shouhong Ding

テキストベースの人物検索のための視覚言語事前トレーニングからテキストの可能性を活用する

テキストベースの人物検索 (TPS) は、クエリ画像ではなくテキストの説明と一致するように歩行者を取得することを目的としています。最近の Vision-Language Pre-training (VLP) モデルは、転送可能な知識を下流の TPS タスクにもたらし、より効率的なパフォーマンスの向上をもたらします。ただし、VLP によって改善された既存の TPS メソッドは、事前にトレーニングされたビジュアルエンコーダーのみを利用し、対応するテキスト表現を無視し、大規模な事前トレーニングから学習した重要なモダリティアライメントを破ります。このホワイトペーパーでは、TPS タスクにおける VLP からのテキストの可能性を最大限に活用する方法を探ります。提案された VLP-TPS ベースラインモデルに基づいて構築します。これは、事前にトレーニングされた両方のモダリティを備えた最初の TPS モデルです。トレーニング中にきめの細かいコーパスのさまざまなコンポーネントを組み込むことにより、テキストモダリティの堅牢性を高めるために、Multi-Integrity Description Constraints (MIDC) を提案します。 VLP モデルを使用したゼロショット分類の迅速なアプローチに着想を得て、動的属性プロンプト (DAP) を提案して、画像モダリティの言語ヒントとして、きめの細かい属性の統一されたコーパスを提供します。広範な実験により、提案されたTPSフレームワークが最先端のパフォーマンスを達成し、以前の最良の方法をわずかに上回っていることが示されています。

Text-based Person Search (TPS), is targeted on retrieving pedestrians to match text descriptions instead of query images. Recent Vision-Language Pre-training (VLP) models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains. However, existing TPS methods improved by VLP only utilize pre-trained visual encoders, neglecting the corresponding textual representation and breaking the significant modality alignment learned from large-scale pre-training. In this paper, we explore the full utilization of textual potential from VLP in TPS tasks. We build on the proposed VLP-TPS baseline model, which is the first TPS model with both pre-trained modalities. We propose the Multi-Integrity Description Constraints (MIDC) to enhance the robustness of the textual modality by incorporating different components of fine-grained corpus during training. Inspired by the prompt approach for zero-shot classification with VLP models, we propose the Dynamic Attribute Prompt (DAP) to provide a unified corpus of fine-grained attributes as language hints for the image modality. Extensive experiments show that our proposed TPS framework achieves state-of-the-art performance, exceeding the previous best method by a margin.

updated: Wed Mar 08 2023 10:41:22 GMT+0000 (UTC)

published: Wed Mar 08 2023 10:41:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト