Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval

Jianfeng Dong; Yabing Wang; Xianke Chen; Xiaoye Qu; Xirong Li; Yuan He; Xun Wang

テキストからビデオへの検索のための読書戦略に触発された視覚表現学習

この論文は、テキストからビデオへの検索のタスクを目的としています。自然言語の文の形式でクエリが与えられると、ラベルのない多数のビデオから、与えられたクエリに意味的に関連するビデオを検索するように求められます。。このタスクの成功は、意味的類似性の計算のためにビデオと文の両方を共通の空間に投影するクロスモーダル表現学習に依存します。この作業では、テキストからビデオへの検索に不可欠なコンポーネントであるビデオ表現学習に焦点を当てます。人間の読書戦略に触発されて、プレビューブランチと多読ブランチの2つのブランチで構成されるビデオを表現するための読書戦略に触発された視覚表現学習（RIVRL）を提案します。プレビューブランチは、ビデオの概要情報を簡単にキャプチャするように設計されていますが、多読ブランチは、より詳細な情報を取得するように設計されています。さらに、多読部門は、プレビュー部門によってキャプチャされたビデオの概要を認識しています。このような全体的な情報は、多読ブランチがよりきめ細かい特徴を抽出するのに役立つことがわかります。 3つのデータセットで広範な実験が行われ、モデルRIVRLがTGIFとVATEXで新しい最先端を実現します。さらに、MSR-VTTでは、2つのビデオ機能を使用したモデルは、7つのビデオ機能を使用した最先端のモデルと同等のパフォーマンスを示し、大規模なHowTo100Mデータセットで事前トレーニングされたモデルよりも優れています。

This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sentences into common spaces for semantic similarity computation. In this work, we concentrate on video representation learning, an essential component for text-to-video retrieval. Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos, which consists of two branches: a previewing branch and an intensive-reading branch. The previewing branch is designed to briefly capture the overview information of videos, while the intensive-reading branch is designed to obtain more in-depth information. Moreover, the intensive-reading branch is aware of the video overview captured by the previewing branch. Such holistic information is found to be useful for the intensive-reading branch to extract more fine-grained features. Extensive experiments on three datasets are conducted, where our model RIVRL achieves a new state-of-the-art on TGIF and VATEX. Moreover, on MSR-VTT, our model using two video features shows comparable performance to the state-of-the-art using seven video features and even outperforms models pre-trained on the large-scale HowTo100M dataset.

updated: Mon Feb 14 2022 07:45:42 GMT+0000 (UTC)

published: Sun Jan 23 2022 03:38:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト