A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

Weijia Wu; Yuzhong Zhao; Zhuang Li; Jiahong Li; Hong Zhou; Mike Zheng Shou; Xiang Bai

読解力を備えた大規模なクロスモーダルビデオ検索データセット

既存のクロスモーダル言語からビデオへの検索 (VR) 研究のほとんどは、ビデオからの単一モーダル入力、つまり視覚的表現に焦点を当てていますが、テキストは人間の環境に遍在しており、ビデオを理解するためにしばしば重要です。モーダル入力、つまり視覚的表現とテキスト意味表現の両方を使用してビデオを取得する方法を研究するために、まず、10.5k ビデオに対する 42.2k 文のクエリを含むテキスト読解、TextVR を使用した大規模でクロスモーダルなビデオ検索データセットを導入します。ストリートビュー (屋内)、ストリートビュー (屋外)、ゲーム、スポーツ、ドライブ、アクティビティ、テレビ番組、料理の 8 つのシナリオドメイン。提案された TextVR では、テキストを認識して理解し、それらを視覚的コンテキストに関連付け、ビデオ検索タスクに不可欠なテキストセマンティック情報を決定するために、1 つの統合されたクロスモーダルモデルが必要です。さらに、既存のデータセットと比較した TextVR の詳細な分析を提示し、テキストベースのビデオ検索タスクのための新しいマルチモーダルビデオ検索ベースラインを設計します。データセットの分析と大規模な実験により、私たちの TextVR ベンチマークが多くの新しい技術的課題と以前のデータセットからの洞察をビデオと言語のコミュニティに提供することが示されました。プロジェクトの Web サイトと GitHub リポジトリは、それぞれ https://sites.google.com/view/loveucvpr23/guest-track と https://github.com/callsys/TextVR にあります。

Most existing cross-modal language-to-video retrieval (VR) research focuses on single-modal input from video, i.e., visual representation, while the text is omnipresent in human environments and frequently critical to understand video. To study how to retrieve video with both modal inputs, i.e., visual and text semantic representations, we first introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR, which contains 42.2k sentence queries for 10.5k videos of 8 scenario domains, i.e., Street View (indoor), Street View (outdoor), Games, Sports, Driving, Activity, TV Show, and Cooking. The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task. Besides, we present a detailed analysis of TextVR compared to the existing datasets and design a novel multimodal video retrieval baseline for the text-based video retrieval task. The dataset analysis and extensive experiments show that our TextVR benchmark provides many new technical challenges and insights from previous datasets for the video-and-language community. The project website and GitHub repo can be found at https://sites.google.com/view/loveucvpr23/guest-track and https://github.com/callsys/TextVR, respectively.

updated: Fri May 05 2023 08:00:14 GMT+0000 (UTC)

published: Fri May 05 2023 08:00:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト