AudioVisual Video Summarization

Bin Zhao; Maoguo Gong; Xuelong Li

視聴覚ビデオ要約

オーディオとビジョンは、ビデオデータの2つの主要なモダリティです。特に視聴覚学習のためのマルチモーダル学習は、最近かなりの注目を集めており、さまざまなコンピュータビジョンタスクのパフォーマンスを向上させることができます。ただし、ビデオの要約では、既存のアプローチは音声情報を無視しながら視覚情報を利用するだけです。この論文では、オーディオモダリティが視覚モダリティを支援してビデオのコンテンツと構造をよりよく理解し、要約プロセスにさらに利益をもたらすことができると主張します。これに動機付けられて、私たちはビデオ要約タスクのためにオーディオとビジュアル情報を共同で活用し、これを達成するためにオーディオビジュアルリカレントネットワーク（AVRN）を開発することを提案します。具体的には、提案されたAVRNは3つの部分に分けることができます。1）2ストリームLSTMを使用して、時間依存性をキャプチャすることにより、オーディオとビジュアルの機能を順番にエンコードします。 2）視聴覚融合LSTMは、2つのモダリティ間の潜在的な一貫性を調査することにより、2つのモダリティを融合するために使用されます。 3）自己注意ビデオエンコーダを採用して、ビデオのグローバルな依存関係をキャプチャします。最後に、融合された視聴覚情報、および統合された時間的およびグローバルな依存関係を共同で使用して、ビデオの要約を予測します。実際には、2つのベンチマーク、つまりSumMeとTVsumでの実験結果は、各部分の有効性と、ビデオの要約に視覚情報を利用するだけのアプローチと比較したAVRNの優位性を示しています。

Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, existing approaches just exploit the visual information while neglect the audio information. In this paper, we argue that the audio modality can assist vision modality to better understand the video content and structure, and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream LSTM is utilized to encode the audio and visual feature sequentially by capturing their temporal dependency. 2) the audiovisual fusion LSTM is employed to fuse the two modalities by exploring the latent consistency between them. 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information, and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, i.e., SumMe and TVsum, have demonstrated the effectiveness of each part, and the superiority of AVRN compared to those approaches just exploiting visual information for video summarization.

updated: Mon May 17 2021 08:36:10 GMT+0000 (UTC)

published: Mon May 17 2021 08:36:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト