DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment

Lewei Yao; Jianhua Han; Xiaodan Liang; Dan Xu; Wei Zhang; Zhenguo Li; Hang Xu

DetCLIPv2: 単語領域アラインメントによるスケーラブルなオープン語彙オブジェクト検出の事前トレーニング

このホワイトペーパーでは、DetCLIPv2 を紹介します。これは、オープンボキャブラリーオブジェクト検出 (OVD) を実現するために大規模な画像とテキストのペアを組み込んだ、効率的でスケーラブルなトレーニングフレームワークです。通常、事前にトレーニングされたビジョン言語モデル (CLIP など) に依存するか、疑似ラベル付けプロセスを介して画像とテキストのペアを利用する以前の OVD フレームワークとは異なり、DetCLIPv2 は大量の画像とテキストのペアからきめの細かい単語と領域の配置を直接学習します。エンドツーエンドの方法で。これを達成するために、地域の提案とテキストの単語との間の最大の単語領域の類似性を採用して、対照的な目的を導きます。モデルが幅広い概念を学習しながらローカリゼーション機能を獲得できるようにするために、DetCLIPv2 は、統一されたデータ定式化の下で、検出、グラウンディング、および画像とテキストのペアデータからのハイブリッド監視でトレーニングされます。 DetCLIPv2 は、代替方式で共同トレーニングを行い、画像とテキストのペアに低解像度の入力を採用することで、画像とテキストのペアデータを効率的かつ効果的に活用します。 DetCLIPv2 は、事前トレーニング用の 13M の画像とテキストのペアを使用して、優れたオープン語彙検出パフォーマンスを示します。たとえば、Swin-T バックボーンを使用した DetCLIPv2 は、LVIS ベンチマークで 40.4% のゼロショット AP を達成し、以前の作品の GLIP/GLIPv2/DetCLIP よりも優れています。 AP はそれぞれ 14.4/11.4/4.5% であり、完全に監視された相手を大幅に上回っています。

This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation. By jointly training with an alternating scheme and adopting low-resolution input for image-text pairs, DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2 utilizes 13X more image-text pairs than DetCLIP with a similar training time and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2 with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP, respectively, and even beats its fully-supervised counterpart by a large margin.

updated: Mon Apr 10 2023 11:08:15 GMT+0000 (UTC)

published: Mon Apr 10 2023 11:08:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト