CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

Yanxin Long; Youpeng Wen; Jianhua Han; Hang Xu; Pengzhen Ren; Wei Zhang; Shen Zhao; Xiaodan Liang

CapDet: 高密度キャプションとオープンワールド検出事前トレーニングの統合

画像とテキストのペアに対する大規模なビジョン言語の事前トレーニングの恩恵を受けて、オープンワールド検出方法は、ゼロショットまたは少数ショット検出設定で優れた一般化能力を示しています。ただし、既存のメソッドの推論段階では、定義済みのカテゴリ空間が依然として必要であり、その空間に属するオブジェクトのみが予測されます。「実際の」オープンワールド検出器を導入するために、この論文では、CapDet という名前の新しい方法を提案して、特定のカテゴリリストの下で予測するか、予測されたバウンディングボックスのカテゴリを直接生成します。具体的には、地域に基づいたキャプションを生成する追加の高密度キャプションヘッドを導入することで、オープンワールドの検出タスクと高密度キャプションタスクを 1 つの効果的なフレームワークに統合します。さらに、キャプションタスクを追加すると、キャプションデータセットがより多くの概念をカバーするため、検出パフォーマンスの一般化に役立ちます。実験結果は、高密度キャプションタスクを統合することにより、CapDet が LVIS のベースラインメソッド (1203 クラス) よりも大幅なパフォーマンスの向上 (たとえば、LVIS レアクラスで +2.1% mAP) を得たことを示しています。さらに、当社の CapDet は高密度のキャプションタスクでも最先端のパフォーマンスを達成します。たとえば、VG V1.2 で 15.44% の mAP、VG-COCO データセットで 13.98% です。

Benefiting from large-scale vision-language pre-training on image-text pairs, open-world detection methods have shown superior generalization ability under the zero-shot or few-shot detection settings. However, a pre-defined category space is still required during the inference stage of existing methods and only the objects belonging to that space will be predicted. To introduce a "real" open-world detector, in this paper, we propose a novel method named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes. Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head to generate the region-grounded captions. Besides, adding the captioning task will in turn benefit the generalization of detection performance since the captioning dataset covers more concepts. Experiment results show that by unifying the dense caption task, our CapDet has obtained significant performance improvements (e.g., +2.1% mAP on LVIS rare classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e.g., 15.44% mAP on VG V1.2 and 13.98% on the VG-COCO dataset.

updated: Sat Mar 04 2023 19:53:00 GMT+0000 (UTC)

published: Sat Mar 04 2023 19:53:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト