Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Xiuye Gu; Tsung-Yi Lin; Weicheng Kuo; Yin Cui

ビジョンと言語知識の蒸留によるオープンボキャブラリーオブジェクトの検出

任意のテキスト入力で記述されたオブジェクトを検出するオープンボキャブラリーオブジェクト検出の進歩を目指しています。基本的な課題は、トレーニングデータの可用性です。既存のオブジェクト検出データセットに含まれるクラスの数をさらに増やすには、コストがかかります。この課題を克服するために、ビジョンと言語の知識の蒸留によるトレーニング方法であるViLDを提案します。私たちの方法は、事前に訓練されたオープンボキャブラリー画像分類モデル（教師）からの知識を2段階の検出器（学生）に抽出します。具体的には、教師モデルを使用して、オブジェクト提案のカテゴリテキストと画像領域をエンコードします。次に、生徒の検出器をトレーニングします。検出されたボックスの領域の埋め込みは、教師が推測したテキストと画像の埋め込みと一致します。トレーニング中には見られない新しいカテゴリとして、すべてのまれなカテゴリを除外することにより、LVISのベンチマークを行います。 ViLDは、ResNet-50バックボーンを備えた16.1マスクAP_rを取得し、監視対象のマスクAP_rを3.8だけ上回っています。より強力な教師モデルALIGNでトレーニングすると、ViLDは26.3AP_rを達成します。モデルは微調整せずに他のデータセットに直接転送でき、PASCALVOCで72.2AP_50、COCOで36.6 AP、Objects365で11.8APを達成します。 COCOでは、ViLDは、新しいAPで4.8、AP全体で11.4だけ、以前の最先端技術を上回っています。コードとデモはhttps://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vildでオープンソース化されています。

We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP_r with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP_r. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP_50 on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.

updated: Thu May 12 2022 01:27:40 GMT+0000 (UTC)

published: Wed Apr 28 2021 17:58:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト