Open-Vocabulary DETR with Conditional Matching

Yuhang Zang; Wei Li; Kaiyang Zhou; Chen Huang; Chen Change Loy

条件付きマッチングを使用したオープンボキャブラリーDETR

自然言語によって導かれる新しいオブジェクトを検出する問題に関係するオープンボキャブラリーオブジェクト検出は、コミュニティからますます注目を集めています。理想的には、自然言語または模範的な画像のいずれかの形式でユーザー入力に基づいてバウンディングボックスの予測を生成できるように、オープンボキャブラリー検出器を拡張したいと考えています。これにより、人間とコンピューターの相互作用に優れた柔軟性とユーザーエクスペリエンスが提供されます。この目的のために、DETRに基づく新しいオープンボキャブラリー検出器を提案します。したがって、OV-DETRという名前は、トレーニングされると、クラス名または模範的な画像を指定して任意のオブジェクトを検出できます。 DETRをオープンボキャブラリー検出器に変える最大の課題は、ラベル付けされた画像にアクセスせずに、新しいクラスの分類コストマトリックスを計算することが不可能なことです。この課題を克服するために、学習目標を、入力クエリ（クラス名または模範画像）と対応するオブジェクトの間で一致するバイナリとして定式化します。これにより、テスト中に見えないクエリに一般化するための有用な対応が学習されます。トレーニングでは、テキストクエリと画像クエリの両方のマッチングを可能にするために、CLIPなどの事前トレーニングされた視覚言語モデルから取得した入力埋め込みでTransformerデコーダーを調整することを選択します。 LVISおよびCOCOデータセットに関する広範な実験により、OV-DETR（最初のエンドツーエンドのTransformerベースのオープンボキャブラリー検出器）が現在の最先端技術に対して重要な改善を達成することを実証します。

Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR -- hence the name OV-DETR -- which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR -- the first end-to-end Transformer-based open-vocabulary detector -- achieves non-trivial improvements over current state of the arts.

updated: Tue Mar 22 2022 16:54:52 GMT+0000 (UTC)

published: Tue Mar 22 2022 16:54:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト