Multimodal Adaptation of CLIP for Few-Shot Action Recognition

Jiazheng Xing; Mengmeng Wang; Xiaojun Hou; Guang Dai; Jingdong Wang; Yong Liu

数ショットのアクション認識のための CLIP のマルチモーダル適応

CLIP のような大規模な事前トレーニング済みビジュアルモデルを数ショットのアクション認識タスクに適用すると、パフォーマンスと効率が向上します。「事前トレーニング、微調整」パラダイムを利用すると、時間とリソースを大量に消費する可能性があるネットワークを最初からトレーニングすることを回避できます。ただし、この方法には 2 つの欠点があります。まず、少数ショットのアクション認識用のラベル付きサンプルが限られているため、過剰適合を軽減するために調整可能なパラメーターの数を最小限に抑える必要があり、また、不十分な微調整につながり、リソースの消費が増加し、モデルの一般化された表現が混乱する可能性があります。第 2 に、事前トレーニングされた視覚モデルは通常画像モデルであるのに対し、ビデオの時間外次元は少数ショット認識の効果的な時間モデリングに課題をもたらします。この論文では、これらの問題に対処するために、Multimodal Adaptation of CLIP (MA-CLIP) と呼ばれる新しい方法を提案します。軽量アダプターを追加することで CLIP を少数ショットのアクション認識に適応させ、学習可能なパラメーターの数を最小限に抑え、モデルをさまざまなタスク間で迅速に転送できるようにします。私たちが設計したアダプターは、ビデオとテキストのマルチモーダルソースからの情報を組み合わせて、タスク指向の時空間モデリングを実現できます。これは、高速かつ効率的で、トレーニングコストが低くなります。さらに、アテンションメカニズムに基づいて、ビデオテキスト情報を完全に活用してビデオプロトタイプの表現を強化できるテキストガイド付きプロトタイプ構築モジュールを設計します。当社の MA-CLIP はプラグアンドプレイであり、さまざまな数ショットアクション認識時間的アライメントメトリックで使用できます。

Applying large-scale pre-trained visual models like CLIP to few-shot action recognition tasks can benefit performance and efficiency. Utilizing the "pre-training, fine-tuning" paradigm makes it possible to avoid training a network from scratch, which can be time-consuming and resource-intensive. However, this method has two drawbacks. First, limited labeled samples for few-shot action recognition necessitate minimizing the number of tunable parameters to mitigate over-fitting, also leading to inadequate fine-tuning that increases resource consumption and may disrupt the generalized representation of models. Second, the video's extra-temporal dimension challenges few-shot recognition's effective temporal modeling, while pre-trained visual models are usually image models. This paper proposes a novel method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues. It adapts CLIP for few-shot action recognition by adding lightweight adapters, which can minimize the number of learnable parameters and enable the model to transfer across different tasks quickly. The adapters we design can combine information from video-text multimodal sources for task-oriented spatiotemporal modeling, which is fast, efficient, and has low training costs. Additionally, based on the attention mechanism, we design a text-guided prototype construction module that can fully utilize video-text information to enhance the representation of video prototypes. Our MA-CLIP is plug-and-play, which can be used in any different few-shot action recognition temporal alignment metric.

updated: Thu Aug 03 2023 04:17:25 GMT+0000 (UTC)

published: Thu Aug 03 2023 04:17:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト