Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

Zejia Weng; Xitong Yang; Ang Li; Zuxuan Wu; Yu-Gang Jiang

Open-VCLIP: 補間重み最適化による CLIP のオープン語彙ビデオモデルへの変換

Contrastive Language-Image Pretraining (CLIP) は、画像理解のための優れたゼロショット学習能力を実証しましたが、ゼロショットビデオ認識のための CLIP を調査する努力は限られています。 Open-VCLIP は、CLIP を、テスト時に目に見えないアクションやイベントを認識できる強力なゼロショットビデオ分類器に変換する、シンプルかつ効果的なアプローチです。私たちのフレームワークは、ビデオ内の時空間関係をモデル化するために最小限の変更を加えて CLIP を拡張し、一般化を目指しながら特殊なビデオ分類器にしています。 Open-VCLIP のトレーニングは、履歴データがゼロの継続的な学習と同等であることを正式に示します。この問題に対処するために、トレーニング時間とテスト時間の両方で重み補間の利点を利用する補間重み最適化を提案します。私たちは、さまざまなゼロショット評価プロトコルに従って、3 つの一般的かつ困難な行動認識データセットで私たちの手法を評価し、私たちのアプローチが最先端の手法よりも明らかに優れていることを実証しました。特に、UCF、HMDB、Kinetics-600 ではそれぞれ 87.9%、58.3%、81.1% のゼロショット精度を達成しており、最先端の手法を 8.3%、7.8%、12.2% 上回っています。コードは https://github.com/wengzejia1/Open-VCLIP で公開されています。

Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive zero-shot learning abilities for image understanding, yet limited effort has been made to investigate CLIP for zero-shot video recognition. We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier that can recognize unseen actions and events at test time. Our framework extends CLIP with minimal modifications to model spatial-temporal relationships in videos, making it a specialized video classifier, while striving for generalization. We formally show that training an Open-VCLIP is equivalent to continual learning with zero historical data. To address this problem, we propose Interpolated Weight Optimization, which utilizes the benefit of weight interpolation in both training and test time. We evaluate our method on three popular and challenging action recognition datasets following various zero-shot evaluation protocols and we demonstrate our approach outperforms state-of-the-art methods by clear margins. In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 respectively, outperforming state-of-the-art methods by 8.3%, 7.8% and 12.2%. Code is released at https://github.com/wengzejia1/Open-VCLIP.

updated: Wed May 31 2023 02:54:28 GMT+0000 (UTC)

published: Wed Feb 01 2023 17:44:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト