Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting

Guangxing Han; Long Chen; Jiawei Ma; Shiyuan Huang; Rama Chellappa; Shih-Fu Chang

メタ学習ベースのクロスモーダルプロンプトを使用したマルチモーダルの少数ショットオブジェクト検出

この論文では、マルチモーダルな少数ショットオブジェクト検出 (FSOD) を研究します。検出には、少数ショットの視覚的な例とクラスのセマンティック情報の両方を使用します。これらは、定義上、互いに補完的です。マルチモーダル FSOD に関する以前の作業のほとんどは微調整ベースであり、オンラインアプリケーションには非効率的です。さらに、これらのメソッドは通常、クラスのセマンティック埋め込みを抽出するためにクラス名などの専門知識を必要としますが、これはまれなクラスでは取得が困難です。私たちのアプローチは、（メトリックベースの）メタ学習とプロンプトベースの学習の高レベルの概念的類似性によって動機付けられ、微調整なしで一般化可能な少数ショットおよびゼロショットオブジェクト検出モデルをそれぞれ学習します。具体的には、メタ学習とプロンプトベースの学習を介してそれぞれ学習した少数ショットの視覚的分類器とテキスト分類器を組み合わせて、マルチモーダル分類器と検出モデルを構築します。さらに、事前トレーニング済みの言語モデルを十分に活用するために、メタ学習ベースのクロスモーダルプロンプトを提案して、少数ショットの視覚的な例に存在する新しいクラスのソフトプロンプトを生成し、テキスト分類器の学習に使用します。まれなクラスでは利用できない可能性がある、クラス名に関する人間の事前知識を使用せずに、ソフトプロンプトジェネレーターを学習するために、知識蒸留が導入されています。数ショットのサポート画像には、関連するコンテキスト情報とクラスのセマンティクスが自然に含まれているというのが私たちの洞察です。複数の少数ショットオブジェクト検出ベンチマークで提案されたマルチモーダル FSOD モデルを包括的に評価し、有望な結果を達成します。

We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection, which are complementary to each other by definition. Most of the previous works on multi-modal FSOD are fine-tuning-based which are inefficient for online applications. Moreover, these methods usually require expertise like class names to extract class semantic embedding, which are hard to get for rare classes. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning to learn generalizable few-shot and zero-shot object detection models respectively without fine-tuning. Specifically, we combine the few-shot visual classifier and text classifier learned via meta-learning and prompt-based learning respectively to build the multi-modal classifier and detection models. In addition, to fully exploit the pre-trained language models, we propose meta-learning-based cross-modal prompting to generate soft prompts for novel classes present in few-shot visual examples, which are then used to learn the text classifier. Knowledge distillation is introduced to learn the soft prompt generator without using human prior knowledge of class names, which may not be available for rare classes. Our insight is that the few-shot support images naturally include related context information and semantics of the class. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.

updated: Mon Mar 27 2023 15:40:57 GMT+0000 (UTC)

published: Sat Apr 16 2022 16:45:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト