Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection

Hwanjun Song; Jihwan Bang

エンドツーエンドのオープン語彙オブジェクト検出のためのプロンプトガイド付きトランスフォーマー

Prompt-OVD は、CLIP からのクラス埋め込みをプロンプトとして利用するオープン語彙オブジェクト検出のための効率的かつ効果的なフレームワークであり、基本クラスと新規クラスの両方でオブジェクトを検出するように Transformer デコーダーを導きます。さらに、当社の新しい RoI ベースのマスクされた注意と RoI プルーニング技術は、Vision Transformer ベースの CLIP のゼロショット分類機能を活用するのに役立ち、最小限の計算コストで検出パフォーマンスを向上させます。 OV-COCO および OVLVIS データセットに関する私たちの実験は、Prompt-OVD が最初のエンドツーエンドのオープン語彙検出方法 (OV-DETR) よりも 21.2 倍速い推論速度を達成し、4 2 よりも高い AP も達成することを示しています。 -同様の推論時間範囲内で動作するステージベースの方法。コードは近日公開予定です。

Prompt-OVD is an efficient and effective framework for open-vocabulary object detection that utilizes class embeddings from CLIP as prompts, guiding the Transformer decoder to detect objects in both base and novel classes. Additionally, our novel RoI-based masked attention and RoI pruning techniques help leverage the zero-shot classification ability of the Vision Transformer-based CLIP, resulting in improved detection performance at minimal computational cost. Our experiments on the OV-COCO and OVLVIS datasets demonstrate that Prompt-OVD achieves an impressive 21.2 times faster inference speed than the first end-to-end open-vocabulary detection method (OV-DETR), while also achieving higher APs than four two-stage-based methods operating within similar inference time ranges. Code will be made available soon.

updated: Sat Mar 25 2023 07:31:08 GMT+0000 (UTC)

published: Sat Mar 25 2023 07:31:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト