Deep Unsupervised Key Frame Extraction for Efficient Video Classification

Hao Tang; Lei Ding; Songsong Wu; Bin Ren; Nicu Sebe; Paolo Rota

効率的なビデオ分類のための深い教師なしキーフレーム抽出

膨大な量のビデオ (Youtube、Hulu など) が毎日オンラインにアップロードされるため、ビデオの処理と分析は緊急のタスクになっています。ビデオからの代表的なキーフレームの抽出は、コンピューティングリソースと時間を大幅に削減するため、ビデオの処理と分析において非常に重要です。最近大きな進歩がありましたが、既存の方法ではパフォーマンスと効率のバランスが取れていないため、大規模なビデオ分類は未解決の問題のままです。この問題に取り組むために、この研究では、畳み込みニューラルネットワーク (CNN) と時間セグメント密度ピーククラスタリング (TSDPC) を組み合わせた、キーフレームを取得する教師なしの方法を提示します。提案された TSDPC は汎用的で強力なフレームワークであり、以前の研究と比較して 2 つの利点があります。1 つは、キーフレームの数を自動的に計算できることです。もう 1 つは、ビデオの時間情報を保持できることです。したがって、ビデオ分類の効率が向上します。さらに、Long Short-Term Memory ネットワーク (LSTM) が CNN の上部に追加され、分類のパフォーマンスがさらに向上します。さらに、異なる入力ネットワークの重み融合戦略を提示して、パフォーマンスを向上させます。ビデオの分類とキーフレームの抽出の両方を同時に最適化することで、分類のパフォーマンスと効率を向上させます。 2 つの一般的なデータセット (HMDB51 と UCF101) でメソッドを評価し、実験結果は、最新のアプローチと比較して、戦略が競争力のあるパフォーマンスと効率を達成することを一貫して示しています。

Video processing and analysis have become an urgent task since a huge amount of videos (e.g., Youtube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is very important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open problem, as the existing methods have not well balanced the performance and efficiency simultaneously. To tackle this problem, this work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC). The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically. The other is that it can preserve the temporal information of the video. Thus it improves the efficiency of video classification. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification. Moreover, a weight fusion strategy of different input networks is presented to boost the performance. By optimizing both video classification and key frame extraction simultaneously, we achieve better classification performance and higher efficiency. We evaluate our method on two popular datasets (i.e., HMDB51 and UCF101) and the experimental results consistently demonstrate that our strategy achieves competitive performance and efficiency compared with the state-of-the-art approaches.

updated: Sat Nov 12 2022 20:45:35 GMT+0000 (UTC)

published: Sat Nov 12 2022 20:45:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト