Explore the Power of Synthetic Data on Few-shot Object Detection

Shaobo Lin; Kun Wang; Xingyu Zeng; Rui Zhao

少数ショットの物体検出における合成データの威力を探る

少数ショットオブジェクト検出 (FSOD) は、トレーニング用のインスタンスがわずかしかない場合に、新しいカテゴリのオブジェクト検出器を拡張することを目的としています。少数のトレーニングサンプルが FSOD モデルのパフォーマンスを制限します。最近のテキストから画像への生成モデルは、高品質の画像を生成する上で有望な結果を示しています。これらの合成画像が FSOD タスクにどの程度適用できるかは、まだ調査されていません。この作業では、最先端のテキストから画像へのジェネレーターから生成された合成画像が FSOD タスクにどのように役立つかを広く研究しています。 (1) 合成データを FSOD にどのように使用するか? (2) 大規模な合成データセットから代表的なサンプルを見つける方法は?合成データを使用するためのコピーアンドペーストベースのパイプラインを設計します。具体的には、生成された元の画像に顕著オブジェクト検出を適用し、最小囲みボックスを使用して、顕著マップに基づいて主要オブジェクトをトリミングします。その後、トリミングされたオブジェクトが、ベースデータセットから取得した画像にランダムに貼り付けられます。また、text-to-image ジェネレーターの入力テキストと使用される合成画像の数の影響も調べます。代表的な合成トレーニングデータセットを構築するために、サンプルベースおよびクラスターベースの方法を使用して、選択した画像の多様性を最大化します。しかし、FSOD における新規カテゴリの偽陽性 (FP) 率が高いという深刻な問題は、合成データを使用しても解決できません。ゼロショット認識モデルである CLIP を FSOD パイプラインに統合することを提案します。これにより、検出されたオブジェクトと予測されたカテゴリのテキストとの間の類似性スコアのしきい値を定義することにより、FP の 90% をフィルター処理できます。 PASCAL VOC と MS COCO での広範な実験により、数回のショットのベースラインと比較して最大 21.9% のパフォーマンス向上が得られる方法の有効性が検証されました。

Few-shot object detection (FSOD) aims to expand an object detector for novel categories given only a few instances for training. The few training samples restrict the performance of FSOD model. Recent text-to-image generation models have shown promising results in generating high-quality images. How applicable these synthetic images are for FSOD tasks remains under-explored. This work extensively studies how synthetic images generated from state-of-the-art text-to-image generators benefit FSOD tasks. We focus on two perspectives: (1) How to use synthetic data for FSOD? (2) How to find representative samples from the large-scale synthetic dataset? We design a copy-paste-based pipeline for using synthetic data. Specifically, saliency object detection is applied to the original generated image, and the minimum enclosing box is used for cropping the main object based on the saliency map. After that, the cropped object is randomly pasted on the image, which comes from the base dataset. We also study the influence of the input text of text-to-image generator and the number of synthetic images used. To construct a representative synthetic training dataset, we maximize the diversity of the selected images via a sample-based and cluster-based method. However, the severe problem of high false positives (FP) ratio of novel categories in FSOD can not be solved by using synthetic data. We propose integrating CLIP, a zero-shot recognition model, into the FSOD pipeline, which can filter 90% of FP by defining a threshold for the similarity score between the detected object and the text of the predicted category. Extensive experiments on PASCAL VOC and MS COCO validate the effectiveness of our method, in which performance gain is up to 21.9% compared to the few-shot baseline.

updated: Thu Mar 23 2023 12:34:52 GMT+0000 (UTC)

published: Thu Mar 23 2023 12:34:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト