Self-supervised Video Retrieval Transformer Network

Xiangteng He; Yulin Pan; Mingqian Tang; Yiliang Lv

自己管理型ビデオ検索トランスフォーマーネットワーク

コンテンツベースのビデオ検索は、特定のクエリビデオに類似しているか、ほぼ重複している大規模なビデオデータベースからビデオを検索することを目的としています。ビデオ表現と類似性検索アルゴリズムは、ビデオ検索システムにとって非常に重要です。効果的なビデオ表現を導出するために、ほとんどのビデオ検索システムは、トレーニングのために手動で注釈を付けた大量のデータを必要とし、コストがかかる非効率的です。さらに、ほとんどの検索システムは、ビデオ類似性検索のフレームレベルの機能に基づいているため、ストレージと検索の両方でコストがかかります。上記の欠点に効果的に対処する、SVRTNと呼ばれる新しいビデオ検索システムを提案します。まず、自己管理型トレーニングを適用して、ラベルのないデータからビデオ表現を効果的に学習し、手動注釈の高額なコストを回避します。次に、トランスフォーマー構造を利用してフレームレベルの機能をクリップレベルに集約し、ストレージスペースと検索の複雑さの両方を軽減します。クリップフレーム間の相互作用から補完的で識別可能な情報を学習するだけでなく、フレームの順列を取得し、より柔軟な検索方法をサポートする不変の機能を失うことができます。 2つの挑戦的なビデオ検索データセット、すなわちFIVR-200KとSVDでの包括的な実験は、精度と効率でビデオ検索の最高のパフォーマンスを達成する、提案されたSVRTNメソッドの有効性を検証します。

Content-based video retrieval aims to find videos from a large video database that are similar to or even near-duplicate of a given query video. Video representation and similarity search algorithms are crucial to any video retrieval system. To derive effective video representation, most video retrieval systems require a large amount of manually annotated data for training, making it costly inefficient. In addition, most retrieval systems are based on frame-level features for video similarity searching, making it expensive both storage wise and search wise. We propose a novel video retrieval system, termed SVRTN, that effectively addresses the above shortcomings. It first applies self-supervised training to effectively learn video representation from unlabeled data to avoid the expensive cost of manual annotation. Then, it exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity. It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners. Comprehensive experiments on two challenging video retrieval datasets, namely FIVR-200K and SVD, verify the effectiveness of our proposed SVRTN method, which achieves the best performance of video retrieval on accuracy and efficiency.

updated: Fri Apr 16 2021 09:43:45 GMT+0000 (UTC)

published: Fri Apr 16 2021 09:43:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト