Reconstructive Sequence-Graph Network for Video Summarization

Bin Zhao; Haopeng Li; Xiaoqiang Lu; Xuelong Li

ビデオ要約のための再構成シーケンスグラフネットワーク

キーショットベースのビデオ要約には、ショット内およびショット間の依存関係を活用することが不可欠です。現在のアプローチは、主にリカレントニューラルネットワークによるフレームシーケンスとしてのビデオのモデリングに専念しています。ただし、シーケンスモデルの潜在的な制限の1つは、長距離の高次の依存関係が十分に活用されていないのに、ローカルの近隣依存関係のキャプチャに焦点を合わせていることです。一般に、各ショットのフレームは特定のアクティビティを記録し、時間の経過とともにスムーズに変化しますが、マルチホップ関係はショット間で頻繁に発生します。この場合、ビデオコンテンツを理解するには、ローカルとグローバルの両方の依存関係が重要です。この点に動機付けられて、フレームとショットをシーケンスとグラフとして階層的にエンコードする再構成シーケンスグラフネットワーク（RSGN）を提案します。ここで、フレームレベルの依存関係は長短期記憶（LSTM）によってエンコードされ、ショット-レベルの依存関係は、グラフ畳み込みネットワーク（GCN）によってキャプチャされます。次に、ショット間のローカルとグローバルの両方の依存関係を利用して、ビデオを要約します。さらに、要約ジェネレーターに報酬を与えるために再構成子が開発されているため、教師なしでジェネレーターを最適化でき、ビデオ要約での注釈付きデータの不足を回避できます。さらに、再構成損失のガイダンスの下で、予測された要約は、メインのビデオコンテンツとショットレベルの依存関係をより適切に保存できます。実際には、3つの人気のあるデータセット（SumMe、TVsum、VTW）での実験結果は、要約タスクに対する提案されたアプローチの優位性を示しています。

Exploiting the inner-shot and inter-shot dependencies is essential for key-shot based video summarization. Current approaches mainly devote to modeling the video as a frame sequence by recurrent neural networks. However, one potential limitation of the sequence models is that they focus on capturing local neighborhood dependencies while the high-order dependencies in long distance are not fully exploited. In general, the frames in each shot record a certain activity and vary smoothly over time, but the multi-hop relationships occur frequently among shots. In this case, both the local and global dependencies are important for understanding the video content. Motivated by this point, we propose a Reconstructive Sequence-Graph Network (RSGN) to encode the frames and shots as sequence and graph hierarchically, where the frame-level dependencies are encoded by Long Short-Term Memory (LSTM), and the shot-level dependencies are captured by the Graph Convolutional Network (GCN). Then, the videos are summarized by exploiting both the local and global dependencies among shots. Besides, a reconstructor is developed to reward the summary generator, so that the generator can be optimized in an unsupervised manner, which can avert the lack of annotated data in video summarization. Furthermore, under the guidance of reconstruction loss, the predicted summary can better preserve the main video content and shot-level dependencies. Practically, the experimental results on three popular datasets i.e., SumMe, TVsum and VTW) have demonstrated the superiority of our proposed approach to the summarization task.

updated: Mon May 10 2021 01:47:55 GMT+0000 (UTC)

published: Mon May 10 2021 01:47:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト