DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Peng Jin; Hao Li; Zesen Cheng; Kehan Li; Xiangyang Ji; Chang Liu; Li Yuan; Jie Chen

DiffusionRet: 拡散モデルを使用した生成テキストビデオ検索

既存のテキストビデオ検索ソリューションは、本質的には、条件付き尤度、つまり p(candidates|query) を最大化することに焦点を当てた判別モデルです。この事実上のパラダイムは単純ではありますが、基礎となるデータ分布 p(クエリ) を見落とすため、分布外のデータを特定することが困難になります。この制限に対処するために、生成的な観点からこのタスクに創造的に取り組み、テキストとビデオの間の相関をそれらの結合確率 p(candidates,query) としてモデル化します。これは、拡散ベースのテキストビデオ検索フレームワーク (DiffusionRet) を通じて実現されます。このフレームワークは、ノイズから結合分布を徐々に生成するプロセスとして検索タスクをモデル化します。トレーニング中、DifffusionRet は生成と識別の両方の観点から最適化されます。ジェネレーターは生成損失によって最適化され、特徴抽出器はコントラスト損失によってトレーニングされます。このように、DiffusionRet は生成手法と識別手法の両方の長所を巧みに活用しています。 MSRVTT、LSMDC、MSVD、ActivityNet Captions、DiDeMo など、一般的に使用される 5 つのテキストビデオ検索ベンチマークでの広範な実験により、優れたパフォーマンスが得られ、この手法の有効性が正当化されました。さらに嬉しいことに、DifffusionRet は変更を加えることなく、ドメイン外の取得設定でも良好にパフォーマンスを発揮します。私たちは、この研究が関連分野に基本的な洞察をもたらすと信じています。コードは https://github.com/jpthu17/DiffusionRet で入手できます。

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code is available at https://github.com/jpthu17/DiffusionRet.

updated: Sat Aug 19 2023 08:31:57 GMT+0000 (UTC)

published: Fri Mar 17 2023 10:07:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト