Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

Zejia Weng; Xitong Yang; Ang Li; Zuxuan Wu; Yu-Gang Jiang

補間された重みの最適化を介してCLIPをオープン語彙のビデオモデルに変換する

Contrasive Language-Image Pretraining (CLIP) は、画像理解のための印象的なゼロショット学習能力を実証しましたが、ゼロショットビデオ認識のための CLIP を調査するための努力は限られています。 Open-VCLIP を紹介します。これは、CLIP をテスト時に見えないアクションやイベントを認識できる強力なゼロショットビデオ分類子に変換する、シンプルでありながら効果的なアプローチです。私たちのフレームワークは、最小限の変更で CLIP を拡張し、ビデオの時空間関係をモデル化し、一般化を目指しながら、特化したビデオ分類子にします。 Open-VCLIP のトレーニングは、履歴データがゼロの継続的な学習と同等であることを正式に示します。この問題に対処するために、トレーニングとテスト時間の両方で重み補間の利点を利用する補間重み最適化を提案します。さまざまなゼロショット評価プロトコルに従って、3 つの一般的でやりがいのある行動認識データセットでこの方法を評価し、私たちのアプローチが最先端の方法よりも明らかに優れていることを示します。特に、UCF、HMDB、Kinetics-600 でそれぞれ 87.9%、58.3%、81.1% のゼロショット精度を達成し、最先端の方法を 8.3%、7.8%、12.2% 上回っています。

Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive zero-shot learning abilities for image understanding, yet limited effort has been made to investigate CLIP for zero-shot video recognition. We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into strong zero-shot video classifiers that can recognize unseen actions and events at test time. Our framework extends CLIP with minimal modifications to model spatial-temporal relationships in videos, making it a specialized video classifier, while striving for generalization. We formally show that training an Open-VCLIP is equivalent to continual learning with zero historical data. To address this problem, we propose Interpolated Weight Optimization, which utilizes the benefit of weight interpolation in both training and test time. We evaluate our method on three popular and challenging action recognition datasets following various zero-shot evaluation protocols and we demonstrate our approach outperforms state-of-the-art methods by clear margins. In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 respectively, outperforming state-of-the-art methods by 8.3%, 7.8% and 12.2%.

updated: Wed Feb 01 2023 17:44:17 GMT+0000 (UTC)

published: Wed Feb 01 2023 17:44:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト