Learning Pixel-Level Distinctions for Video Highlight Detection

Fanyue Wei; Biao Wang; Tiezheng Ge; Yuning Jiang; Wen Li; Lixin Duan

ビデオハイライト検出のためのピクセルレベルの区別の学習

ビデオハイライト検出の目的は、長いビデオから最も魅力的なセグメントを選択して、ビデオの最も興味深い部分を描写することです。既存の方法は通常、これらのセグメントにハイライトスコアを割り当てることができるモデルを学習するために、異なるビデオセグメント間の関係のモデリングに焦点を合わせています。ただし、これらのアプローチでは、個々のセグメント内のコンテキスト依存関係を明示的に考慮していません。この目的のために、ビデオハイライト検出を改善するためにピクセルレベルの区別を学習することを提案します。このピクセルレベルの違いは、1つのビデオの各ピクセルが興味深いセクションに属しているかどうかを示します。このような細かいレベルの区別をモデル化することの利点は2つあります。まず、1つのフレームのピクセルの区別は、このフレームの前のコンテンツとこのフレームのこのピクセルの周囲のコンテンツの両方に大きく依存するため、1つのビデオのコンテンツの時間的および空間的関係を活用できます。第二に、ピクセルレベルの区別を学ぶことは、ハイライトセグメントのどのコンテンツが人々にとって魅力的であるかに関するビデオハイライトタスクへの良い説明にもなります。エンコーダー-デコーダーネットワークを設計して、ピクセルレベルの区別を推定します。このネットワークでは、3D畳み込みニューラルネットワークを活用して時間的コンテキスト情報を活用し、さらに視覚的顕著性を利用して空間的区別をモデル化します。 3つの公開ベンチマークでの最先端のパフォーマンスは、ビデオハイライト検出のフレームワークの有効性を明確に検証します。

The goal of video highlight detection is to select the most attractive segments from a long video to depict the most interesting parts of the video. Existing methods typically focus on modeling relationship between different video segments in order to learning a model that can assign highlight scores to these segments; however, these approaches do not explicitly consider the contextual dependency within individual segments. To this end, we propose to learn pixel-level distinctions to improve the video highlight detection. This pixel-level distinction indicates whether or not each pixel in one video belongs to an interesting section. The advantages of modeling such fine-level distinctions are two-fold. First, it allows us to exploit the temporal and spatial relations of the content in one video, since the distinction of a pixel in one frame is highly dependent on both the content before this frame and the content around this pixel in this frame. Second, learning the pixel-level distinction also gives a good explanation to the video highlight task regarding what contents in a highlight segment will be attractive to people. We design an encoder-decoder network to estimate the pixel-level distinction, in which we leverage the 3D convolutional neural networks to exploit the temporal context information, and further take advantage of the visual saliency to model the spatial distinction. State-of-the-art performance on three public benchmarks clearly validates the effectiveness of our framework for video highlight detection.

updated: Sun Apr 10 2022 06:41:16 GMT+0000 (UTC)

published: Sun Apr 10 2022 06:41:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト