Zero-shot Object Detection Through Vision-Language Embedding Alignment

Johnathan Xie; Shuai Zheng

視覚言語埋め込みアラインメントによるゼロショットオブジェクト検出

最近のアプローチでは、ディープニューラルネットワークを大規模な画像とテキストのペアコレクションで直接トレーニングすると、さまざまな認識タスクでゼロショット転送が可能になることが示されています。中心的な問題の 1 つは、これをオブジェクト検出に一般化する方法です。これには、ローカリゼーションの非セマンティックタスクと分類のセマンティックタスクが含まれます。この問題を解決するために、CLIP などの事前トレーニング済みモデルの一般化機能を YOLOv5 などのオブジェクト検出器に転送する視覚言語埋め込みアラインメントメソッドを導入します。事前トレーニング済みモデル CLIP からの画像とテキストの埋め込みを、検出器からの変更されたセマンティック予測ヘッドに合わせることを可能にする損失関数を定式化します。この方法により、COCO、ILSVRC、Visual Genome ゼロショット検出ベンチマークで最先端のパフォーマンスを達成するオブジェクト検出器をトレーニングできます。推論中に、追加のトレーニングなしで任意の数のオブジェクトクラスを検出するようにモデルを適応させることができます。また、標準のオブジェクト検出スケーリングがこの方法にうまく適用され、YOLOv5 モデルと YOLOv3 モデルのさまざまなスケールにわたって一貫した改善が見られることもわかりました。最後に、余分な画像やラベルを必要とせずにスコアを大幅に改善する自己ラベリング方法を開発します。

Recent approaches have shown that training deep neural networks directly on large-scale image-text pair collections enables zero-shot transfer on various recognition tasks. One central issue is how this can be generalized to object detection, which involves the non-semantic task of localization as well as semantic task of classification. To solve this problem, we introduce a vision-language embedding alignment method that transfers the generalization capabilities of a pretrained model such as CLIP to an object detector like YOLOv5. We formulate a loss function that allows us to align the image and text embeddings from the pretrained model CLIP with the modified semantic prediction head from the detector. With this method, we are able to train an object detector that achieves state-of-the-art performance on the COCO, ILSVRC, and Visual Genome zero-shot detection benchmarks. During inference, our model can be adapted to detect any number of object classes without additional training. We also find that standard object detection scaling can transfer well to our method and find consistent improvements across various scales of YOLOv5 models and the YOLOv3 model. Lastly, we develop a self-labeling method that provides a significant score improvement without needing extra images nor labels.

updated: Fri Aug 26 2022 03:54:26 GMT+0000 (UTC)

published: Fri Sep 24 2021 16:46:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト