P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Yanxin Long; Jianhua Han; Runhui Huang; Xu Hang; Yi Zhu; Chunjing Xu; Xiaodan Liang

P^3OVD: オープン語彙オブジェクト検出のためのきめの細かいビジュアルテキストプロンプト駆動型セルフトレーニング

ゼロショット分類における視覚言語法 (VLM) の成功に着想を得て、最近の研究では、事前にトレーニングされた VLM のローカリゼーション機能を活用し、自己の目に見えないクラスの疑似ラベルを生成することにより、この一連の作業をオブジェクト検出に拡張しようとしています。・トレーニング方法。ただし、現在の VLM は通常、文の埋め込みをグローバルな画像の埋め込みに合わせて事前にトレーニングされているため、それらを直接使用すると、検出の核となるオブジェクトインスタンスのきめ細かい位置合わせができません。この論文では、Open-Vocabulary Detection (P^3OVD) のためのシンプルだが効果的な Pretrain-adaPt-Pseudo ラベリングパラダイムを提案します。強力なきめの細かいアライメント。適応段階では、VLM が学習可能なテキストプロンプトを使用してきめ細かいアライメントを取得し、高密度のピクセル単位の補助的な予測タスクを解決できるようにします。さらに、視覚的プロンプトモジュールを提案して、ビジョンブランチの事前タスク情報 (つまり、予測する必要があるカテゴリ) を提供し、事前トレーニング済みの VLM を下流のタスクによりよく適合させます。実験は、私たちの方法がオープン語彙オブジェクト検出の最先端のパフォーマンスを達成することを示しています。たとえば、見えないクラスの COCO で 31.5% mAP です。

Inspired by the success of visual-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo labels for unseen classes in a self-training manner. However, since the current VLMs are usually pre-trained with aligning sentence embedding with global image embedding, the direct use of them lacks fine-grained alignment for object instances, which is the core of detection. In this paper, we propose a simple but effective Pretrain-adaPt-Pseudo labeling paradigm for Open-Vocabulary Detection (P^3OVD) that introduces a fine-grained visual-text prompt adapting stage to enhance the current self-training paradigm with a more powerful fine-grained alignment. During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Furthermore, we propose a visual prompt module to provide the prior task information (i.e., the categories need to be predicted) for the vision branch to better adapt the pretrained VLM to the downstream tasks. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.

updated: Wed Nov 02 2022 03:38:02 GMT+0000 (UTC)

published: Wed Nov 02 2022 03:38:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト