UATVR: Uncertainty-Adaptive Text-Video Retrieval

Bo Fang; Wenhao Wu; Chang Liu; Yu Zhou; Yuxin Song; Weiping Wang; Xiangbo Shu; Xiangyang Ji; Jingdong Wang

UATVR: 不確実性適応型テキストビデオ検索

Web ビデオの爆発的な成長と、CLIP などの大規模なビジョン言語事前トレーニングモデルの出現により、テキストの指示で興味のあるビデオを取得することがますます注目を集めています。一般的な手法は、テキストとビデオのペアを同じ埋め込み空間に転送し、セマンティック対応のために特定の粒度で特定のエンティティとのクロスモーダルインタラクションを作成することです。残念ながら、クロスモーダルクエリに対する適切な粒度での最適なエンティティの組み合わせの本質的な不確実性は十分に研究されていません。これは、ビデオ、テキストなどの階層的なセマンティクスを持つモダリティにとって特に重要です。この論文では、不確実性適応型テキストを提案します。 UATVR と呼ばれるビデオ検索アプローチ。各ルックアップを分布マッチング手順としてモデル化します。具体的には、エンコーダーに学習可能なトークンを追加して、マルチグレインセマンティクスを適応的に集約して、柔軟な高レベルの推論を実現します。洗練された埋め込み空間では、テキストとビデオのペアを確率分布として表し、プロトタイプがマッチング評価のためにサンプリングされます。 4 つのベンチマークに関する包括的な実験により、MSR-VTT (50.8%)、VATEX (64.5%)、MSVD (49.7%)、および DiDeMo (45.8%) で新しい最先端の結果を達成した当社の UATVR の優位性が正当化されています。。コードは https://github.com/bofang98/UATVR で入手できます。

With the explosive growth of web videos and emerging large-scale vision-language pre-training models, e.g., CLIP, retrieving videos of interest with text instructions has attracted increasing attention. A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities in specific granularities for semantic correspondence. Unfortunately, the intrinsic uncertainties of optimal entity combinations in appropriate granularities for cross-modal queries are understudied, which is especially critical for modalities with hierarchical semantics, e.g., video, text, etc. In this paper, we propose an Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure. Concretely, we add additional learnable tokens in the encoders to adaptively aggregate multi-grained semantics for flexible high-level reasoning. In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation. Comprehensive experiments on four benchmarks justify the superiority of our UATVR, which achieves new state-of-the-art results on MSR-VTT (50.8%), VATEX (64.5%), MSVD (49.7%), and DiDeMo (45.8%). The code is available at https://github.com/bofang98/UATVR.

updated: Sat Aug 19 2023 02:28:10 GMT+0000 (UTC)

published: Mon Jan 16 2023 08:43:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト