Learning to Detect and Segment for Open Vocabulary Object Detection

Tao Wang; Nan Li

オープン語彙オブジェクト検出のための検出とセグメント化の学習

オープンボキャブラリオブジェクトの検出は、視覚言語の事前トレーニング済みモデルの最近の開発によって大幅に進歩しました。このモデルは、セマンティックカテゴリのみで新しいオブジェクトを認識するのに役立ちます。以前の研究は、主にオブジェクト提案分類への知識の伝達に焦点を当てており、クラスにとらわれないボックスとマスク予測を採用しています。この作業では、CondHead を提案します。CondHead は、ボックス回帰をより一般化し、オープンボキャブラリ設定のマスクセグメンテーションを行うための原則に基づいた動的ネットワーク設計です。核となるアイデアは、セマンティック埋め込みでネットワークヘッドを条件付きでパラメーター化することです。したがって、モデルはクラス固有の知識で導かれ、新しいカテゴリをより適切に検出します。具体的には、CondHead は、ネットワークヘッドの 2 つのストリーム (動的に集約されたヘッドと動的に生成されたヘッド) で構成されます。前者は、条件付きで集約された一連の静的ヘッドでインスタンス化されます。これらのヘッドはエキスパートとして最適化され、高度な予測を学習することが期待されます。後者は、動的に生成されたパラメーターでインスタンス化され、一般的なクラス固有の情報をエンコードします。このような条件付き設計により、検出モデルはセマンティック埋め込みによって橋渡しされ、強力に一般化可能なクラス単位のボックスおよびマスク予測を提供します。私たちの方法は、最先端のオープン語彙オブジェクト検出方法を非常にわずかなオーバーヘッドで大幅に改善します。たとえば、新しいカテゴリで 3.0 検出 AP で RegionClip モデルを上回り、計算量はわずか 1.1% 増加します。

Open vocabulary object detection has been greatly advanced by the recent development of vision-language pretrained model, which helps recognize novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parameterize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated head and the dynamically generated head. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such a conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation.

updated: Wed Mar 29 2023 01:05:39 GMT+0000 (UTC)

published: Fri Dec 23 2022 03:54:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト