CLIP-guided Prototype Modulating for Few-shot Action Recognition

Xiang Wang; Shiwei Zhang; Jun Cen; Changxin Gao; Yingya Zhang; Deli Zhao; Nong Sang

少数ショット動作認識のための CLIP ガイド付きプロトタイプ変調

CLIP のような大規模な対照的な言語イメージの事前トレーニングからの学習は、最近、幅広いダウンストリームタスクで目覚ましい成功を収めていますが、挑戦的な少数ショットアクション認識 (FSAR) タスクではまだ調査されていません。この作業では、CLIP の強力なマルチモーダル知識を転送して、データ不足による不正確なプロトタイプ推定の問題を軽減することを目指しています。これは、ローショット体制の重大な問題です。この目的のために、CLIP-FSAR と呼ばれる CLIP ガイド付きプロトタイプ変調フレームワークを提示します。これは、ビデオテキスト対比目標とプロトタイプ変調の 2 つの主要コンポーネントで構成されます。具体的には、前者は、ビデオと対応するクラステキストの説明を対比することで、CLIP と少数ショットビデオタスクの間のタスクの不一致を埋めます。後者は、CLIP からの転送可能なテキストの概念を活用して、一時的な Transformer を使用してビジュアルプロトタイプを適応的に改良します。これにより、CLIP-FSAR は CLIP の豊富なセマンティックプライアを最大限に活用して、信頼性の高いプロトタイプを取得し、正確な少数ショット分類を実現できます。一般的に使用される 5 つのベンチマークでの広範な実験により、提案された方法の有効性が実証され、CLIP-FSAR はさまざまな設定の下で既存の最先端の方法よりも大幅に優れています。ソースコードとモデルは、https://github.com/alibaba-mmai-research/CLIP-FSAR で公開されます。

Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue due to data scarcity, which is a critical problem in low-shot regimes. To this end, we present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of two key components: a video-text contrastive objective and a prototype modulation. Specifically, the former bridges the task discrepancy between CLIP and the few-shot video task by contrasting videos and corresponding class text descriptions. The latter leverages the transferable textual concepts from CLIP to adaptively refine visual prototypes with a temporal Transformer. By this means, CLIP-FSAR can take full advantage of the rich semantic priors in CLIP to obtain reliable prototypes and achieve accurate few-shot classification. Extensive experiments on five commonly used benchmarks demonstrate the effectiveness of our proposed method, and CLIP-FSAR significantly outperforms existing state-of-the-art methods under various settings. The source code and models will be publicly available at https://github.com/alibaba-mmai-research/CLIP-FSAR.

updated: Mon Mar 06 2023 09:17:47 GMT+0000 (UTC)

published: Mon Mar 06 2023 09:17:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト