CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

Xiaoshi Wu; Feng Zhu; Rui Zhao; Hongsheng Li

CORA: リージョンプロンプトとアンカープレマッチングを使用したオープン語彙検出用の CLIP の適応

オープン語彙検出 (OVD) は、検出器がトレーニングされる基本カテゴリを超えた新しいカテゴリからオブジェクトを検出することを目的としたオブジェクト検出タスクです。最近の OVD メソッドは、新しいオブジェクトを認識するために、CLIP などの大規模な視覚言語の事前トレーニング済みモデルに依存しています。これらのモデルを検出器トレーニングに組み込む際に取り組む必要がある 2 つの主要な障害を特定します。(1) 画像全体でトレーニングされた VL モデルを領域認識タスクに適用するときに発生する分布の不一致。 (2) 目に見えないクラスのオブジェクトをローカライズすることの難しさ。これらの障害を克服するために、CORA を提案します。CORA は DETR スタイルのフレームワークであり、リージョンプロンプトとアンカーの事前照合によるオープン語彙検出に CLIP を適応させます。リージョンプロンプトは、CLIP ベースのリージョン分類子のリージョンフィーチャをプロンプトすることにより、リージョン全体の分布ギャップを軽減します。アンカーの事前照合は、クラスを意識した照合メカニズムによって、一般化可能なオブジェクトのローカリゼーションを学習するのに役立ちます。 COCO OVD ベンチマークで CORA を評価すると、新しいクラスで 41.7 AP50 を達成し、追加のトレーニングデータに頼らなくても、以前の SOTA を 2.4 AP50 上回っています。追加のトレーニングデータが利用可能な場合、グラウンドトゥルースベースカテゴリアノテーションと CORA によって計算された追加の疑似バウンディングボックスラベルの両方で CORA^+ をトレーニングします。 CORA^+ は、COCO OVD ベンチマークで 43.1 AP50、LVIS OVD ベンチマークで 28.1 ボックス APr を達成しています。

Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA^+ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA^+ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.

updated: Thu Mar 23 2023 07:13:57 GMT+0000 (UTC)

published: Thu Mar 23 2023 07:13:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト