VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Siteng Huang; Biao Gong; Yulin Pan; Jianwen Jiang; Yiliang Lv; Yuyuan Li; Donglin Wang

VoP: クロスモーダル検索のためのテキストビデオ協同組合プロンプトチューニング

最近の多くの研究では、追加の重いモジュールでバックボーンを調整することにより、テキストとビデオのクロスモーダル検索に事前トレーニング済みの CLIP を活用しています。この作業では、VoP: Text-Video Co-operative Prompt Tuning を提案して、テキストビデオ検索タスクの効率的なチューニングを行います。提案された VoP は、ビデオとテキストプロンプトの両方を導入したエンドツーエンドのフレームワークであり、トレーニング可能なパラメーターが 0.1% しかない強力なベースラインと見なすことができます。さらに、ビデオの時空間特性に基づいて、トレーニング可能なパラメーターの異なるスケールでパフォーマンスを向上させる 3 つの新しいビデオプロンプトメカニズムを開発します。 VoP 拡張の基本的な考え方は、フレーム位置、フレームコンテキスト、レイヤー関数をそれぞれ特定のトレーニング可能なプロンプトでモデル化することです。広範な実験によると、完全な微調整と比較して、強化された VoP は 5 つのテキストビデオ検索ベンチマーク全体で 1.4% の平均 R@1 ゲインを達成し、パラメーターのオーバーヘッドは 6 倍少ないことが示されています。コードは https://github.com/bighuang624/VoP で入手できます。

Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models.In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video & text prompts introducing, which can be regarded as a powerful baseline with only 0.1% trainable parameters. Further, based on the spatio-temporal characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achieves a 1.4% average R@1 gain across five text-video retrieval benchmarks with 6x less parameter overhead. The code will be available at https://github.com/bighuang624/VoP.

updated: Wed Mar 08 2023 06:31:05 GMT+0000 (UTC)

published: Wed Nov 23 2022 08:20:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト