Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

Damianos Galanopoulos; Vasileios Mezaris

すべての組み合わせは同じですか?テキストベースのビデオ検索のための複数の空間学習によるテキストおよび視覚的特徴の組み合わせ

この論文では、クロスモーダルビデオ検索の問題に取り組み、より具体的には、テキストからビデオへの検索に焦点を当てています。複数の多様なテキストおよび視覚的特徴を特徴ペアに最適に組み合わせて、テキストとビデオのペアを同等の表現にエンコードする複数の共同特徴空間を生成する方法を調査します。これらの表現を学習するために、提案されたネットワークアーキテクチャは、複数の空間学習手順に従ってトレーニングされます。さらに、検索段階で、推論されたクエリとビデオの類似性を修正するための追加のソフトマックス操作を導入します。 3 つの大規模なデータセット (IACC.3、V3C1、および MSR-VTT) に基づくいくつかのセットアップでの広範な実験により、テキストとビジュアルの機能を最適に組み合わせ、提案されたネットワークのパフォーマンスを文書化する方法に関する結論が導き出されました。ソースコードは、https://github.com/bmezaris/TextToVideoRetrieval-TtimesV で公開されています。

In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at: https://github.com/bmezaris/TextToVideoRetrieval-TtimesV

updated: Mon Nov 21 2022 11:08:13 GMT+0000 (UTC)

published: Mon Nov 21 2022 11:08:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト