QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Jie Lei; Tamara L. Berg; Mohit Bansal

QVHighlights：自然言語クエリを介したビデオの瞬間とハイライトの検出

自然言語（NL）のユーザークエリが与えられたビデオからカスタマイズされた瞬間とハイライトを検出することは重要ですが、十分に研究されていないトピックです。この方向性を追求する上での課題の1つは、注釈付きデータの欠如です。この問題に対処するために、クエリベースのビデオハイライト（QVHighlights）データセットを紹介します。 10,000本を超えるYouTube動画で構成されており、日常の活動やライフスタイルのvlog動画での旅行から、ニュース動画での社会的および政治的活動まで、幅広いトピックをカバーしています。データセット内の各ビデオには、（1）人間が作成した自由形式のNLクエリ、（2）クエリに関連するビデオの瞬間、（3）クエリに関連するすべてのクリップの5段階の顕著性スコアの注釈が付けられます。。この包括的な注釈により、関連する瞬間を検出するシステムと、多様で柔軟なユーザークエリの顕著なハイライトを開発および評価できます。また、このタスクの強力なベースラインであるMoment-DETRを提示します。これは、モーメント検索を直接セット予測問題と見なし、抽出されたビデオとクエリ表現を入力として受け取り、モーメント座標と顕著性スコアをエンドツーエンドで予測するトランスフォーマーエンコーダーデコーダーモデルです。終わり。私たちのモデルは人間の事前情報を利用していませんが、適切に設計されたアーキテクチャと比較した場合、競争力のあるパフォーマンスを発揮することを示しています。 ASRキャプションを使用した弱く監視された事前トレーニングにより、Moment-DETRは以前の方法を大幅に上回ります。最後に、Moment-DETRのいくつかのアブレーションと視覚化を示します。データとコードはhttps://github.com/jayleicn/moment_detrで公開されています

Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, Moment-DETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR. Data and code is publicly available at https://github.com/jayleicn/moment_detr

updated: Tue Jul 20 2021 16:42:58 GMT+0000 (UTC)

published: Tue Jul 20 2021 16:42:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト