FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training

Adrian Bulat; Ricardo Guerrero; Brais Martinez; Georgios Tzimiropoulos

FS-DETR: プロンプトあり、再トレーニングなしの少数ショット検出トランスフォーマー

このペーパーは、少数ショットオブジェクト検出 (FSOD) に関するもので、新しいクラス (トレーニング中には表示されない) を表すいくつかのテンプレート (例) が与えられた場合、目標は、一連の画像内でそのクラスの出現をすべて検出することです。実用的な観点から見ると、FSOD システムは次の要求を満たす必要があります: (a) テスト時に微調整を必要とせずにそのまま使用できる必要がある、(b) 任意の数の新しいオブジェクトを同時に処理できなければならない各クラスからの任意の数の例をサポートしながら、(c) 閉じたシステムに匹敵する精度を達成する必要があります。 (a) ～ (c) を満たすために、この研究では次の貢献を行います。視覚的なプロンプトに基づいた、シンプルでありながら強力な少数ショット検出トランス (FS-DETR) を初めて導入します。 (a) と (b) の両方の要望に対処します。私たちのシステムは DETR フレームワークに基づいて構築されており、次の 2 つの重要なアイデアに基づいてそれを拡張しています。(1) テスト時に、新しいクラスの提供されたビジュアルテンプレートをビジュアルプロンプトとしてフィードする、(2) これらのプロンプトに疑似クラスを「スタンプ」するエンベディング (ソフトプロンプトに似ています)。デコーダーの出力で予測されます。重要なのは、私たちのシステムが既存の方法よりも柔軟であるだけでなく、要望 (c) を満たすための一歩を踏み出すことを示していることです。具体的には、微調整を必要としないすべての方法よりも大幅に正確であり、最も確立されたベンチマーク (PASCAL VOC および MSCOCO) に基づいた現在の最先端の微調整ベースの方法と同等またはそれを上回るパフォーマンスさえあります。

This paper is on Few-Shot Object Detection (FSOD), where given a few templates (examples) depicting a novel class (not seen during training), the goal is to detect all of its occurrences within a set of images. From a practical perspective, an FSOD system must fulfil the following desiderata: (a) it must be used as is, without requiring any fine-tuning at test time, (b) it must be able to process an arbitrary number of novel objects concurrently while supporting an arbitrary number of examples from each class and (c) it must achieve accuracy comparable to a closed system. Towards satisfying (a)-(c), in this work, we make the following contributions: We introduce, for the first time, a simple, yet powerful, few-shot detection transformer (FS-DETR) based on visual prompting that can address both desiderata (a) and (b). Our system builds upon the DETR framework, extending it based on two key ideas: (1) feed the provided visual templates of the novel classes as visual prompts during test time, and (2) ``stamp'' these prompts with pseudo-class embeddings (akin to soft prompting), which are then predicted at the output of the decoder. Importantly, we show that our system is not only more flexible than existing methods, but also, it makes a step towards satisfying desideratum (c). Specifically, it is significantly more accurate than all methods that do not require fine-tuning and even matches and outperforms the current state-of-the-art fine-tuning based methods on the most well-established benchmarks (PASCAL VOC & MSCOCO).

updated: Sun Aug 20 2023 12:23:49 GMT+0000 (UTC)

published: Mon Oct 10 2022 17:03:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト