Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Yimu Wang; Peng Shi

教師ありマルチスペースマルチグレインアラインメントによるビデオテキスト検索

ビデオテキスト検索の最近の進歩は、より良い表現学習の探求によって進歩しましたが、この論文では、ビデオとビデオテキスト検索用のテキスト。共有整列空間は、それぞれがいくつかの基本概念 (単語) を参照する有限数の概念クラスターで初期化されます。テキストデータが手元にあるので、提案された類似性とアライメントの損失を使用して、監視された方法で共有のアライメントされたスペースを更新できます。さらに、マルチグレインアラインメントを有効にするために、ビデオモダリティをより適切にモデル化し、細粒度および粗粒度の類似性を計算するためのフレーム表現を組み込みます。学習された共有整列空間とマルチグレイン類似性の恩恵を受けて、いくつかのビデオテキスト検索ベンチマークに関する広範な実験により、既存の方法に対する SUMA の優位性が実証されました。

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the shared aligned space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarity. Benefiting from learned shared aligned space and multi-grained similarity, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of SUMA over existing methods.

updated: Sun Feb 19 2023 04:03:22 GMT+0000 (UTC)

published: Sun Feb 19 2023 04:03:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト