Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

Wenhao Wu; Zhun Sun; Wanli Ouyang

分類器の再検討: ビデオ認識のための視覚言語モデルの転送

ダウンストリームタスクのために、タスクに依存しない事前トレーニング済みのディープモデルから知識を転送することは、コンピュータービジョン研究の重要なトピックです。計算能力の向上に伴い、大規模なモデルアーキテクチャとデータ量で、オープンソースのビジョン言語の事前トレーニング済みモデルを利用できるようになりました。この研究では、ビデオ分類タスクの知識の伝達に焦点を当てています。従来の方法では、視覚分類のために線形分類ヘッドをランダムに初期化しますが、下流の視覚認識タスクでのテキストエンコーダの使用法は発見されていません。この論文では、線形分類器の役割を修正し、分類器を事前トレーニング済みモデルとは異なる知識に置き換えます。十分に訓練された言語モデルを利用して、効率的な転移学習のための優れたセマンティックターゲットを生成します。実証研究は、私たちの方法がビデオ分類のパフォーマンスとトレーニング速度の両方を改善することを示しており、モデルの変化はごくわずかです。当社のシンプルかつ効果的なチューニングパラダイムは、さまざまなビデオ認識シナリオ (つまり、ゼロショット、少数ショット、一般的な認識) で最先端のパフォーマンスと効率的なトレーニングを実現します。特に、私たちのパラダイムは、Kinetics-400 で 87.8% の最先端の精度を達成し、以前の方法を 20 ~ 50% の絶対的なトップ 1 精度で上回り、5 つの一般的なビデオデータセット。コードとモデルは https://github.com/whwu95/Text4Vis にあります。

Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source vision-language pre-trained models in large scales of the model architecture and amount of data. In this study, we focus on transferring knowledge for video classification tasks. Conventional methods randomly initialize the linear classifier head for vision classification, but they leave the usage of the text encoder for downstream visual recognition tasks undiscovered. In this paper, we revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model. We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning. The empirical study shows that our method improves both the performance and the training speed of video classification, with a negligible change in the model. Our simple yet effective tuning paradigm achieves state-of-the-art performance and efficient training on various video recognition scenarios, i.e., zero-shot, few-shot, general recognition. In particular, our paradigm achieves the state-of-the-art accuracy of 87.8% on Kinetics-400, and also surpasses previous methods by 20~50% absolute top-1 accuracy under zero-shot, few-shot settings on five popular video datasets. Code and models can be found at https://github.com/whwu95/Text4Vis .

updated: Sun Mar 26 2023 16:28:26 GMT+0000 (UTC)

published: Mon Jul 04 2022 10:00:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト