Multi-Modal Few-Shot Temporal Action Detection

Sauradip Nag; Mengmeng Xu; Xiatian Zhu; Juan-Manuel Perez-Rua; Bernard Ghanem; Yi-Zhe Song; Tao Xiang

マルチモーダル少数ショットの一時的な行動検出

少数ショット (FS) とゼロショット (ZS) 学習は、時間アクション検出 (TAD) を新しいクラスにスケーリングするための 2 つの異なるアプローチです。前者は、事前トレーニング済みのビジョンモデルを、クラスごとにわずか 1 つのビデオで表される新しいタスクに適応させますが、後者は、新しいクラスのセマンティック記述を活用することにより、トレーニング例を必要としません。この作業では、新しいマルチモダリティフューズショット (MMFS) TAD 問題を紹介します。これは、フューズショットサポートビデオと新しいクラス名を組み合わせて活用することで、FS-TAD と ZS-TAD の融合と見なすことができます。この問題に取り組むために、新しいマルチモダリティ PromPt メタ学習 (MUPPET) メソッドをさらに紹介します。これは、学習済みの能力を最大限に再利用しながら、事前トレーニング済みのビジョンと言語モデルを効率的に橋渡しすることによって可能になります。具体的には、メタ学習アダプターを備えたビジュアルセマンティクストークナイザーを使用して、サポートビデオをビジョン言語モデルのテキストトークンスペースにマッピングすることにより、マルチモーダルプロンプトを構築します。クラス内の大きな変動に対処するために、クエリ機能規制スキームをさらに設計します。 ActivityNetv1.3 と THUMOS14 での広範な実験により、当社の MUPPET が最先端の代替方法よりも多くの場合大幅に優れていることが示されています。また、MUPPET を簡単に拡張して少数ショットのオブジェクト検出の問題に取り組み、MS-COCO データセットで最先端のパフォーマンスを実現できることも示しています。コードは https://github.com/sauradip/MUPPET で入手できます

Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPET

updated: Mon Mar 27 2023 08:39:13 GMT+0000 (UTC)

published: Sun Nov 27 2022 18:13:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト