Multi-Stream Dynamic Video Summarization

Mohamed Elfeki; Liqiang Wang; Ali Borji

マルチストリーム動的ビデオ要約

毎分大量のビデオコンテンツがインターネットにアップロードされるため、ビジュアルコンテンツの効率的なブラウジング、検索、およびインデックス作成には、ビデオの要約が重要になります。それにもかかわらず、社会的および自己中心的なカメラの普及は、いくつかのデバイスによってキャプチャされた豊富なまばらなシナリオを作成し、最終的には共同で要約する必要があります。この論文では、視野を断続的に共有するいくつかのダイナミックカメラによって独立して記録されたビデオを要約する問題について説明します。（a）同じシーンをキャプチャしていないことが多い移動カメラ間で重要なイベントの多様なセットを識別し、（b）ユニバーサルサマリーに含める各イベントで最も代表的なビューを選択する堅牢なフレームワークを提示します。適用可能な代替手段がないため、新しいマルチビューの自己中心的なデータセットであるMulti-Egoを収集しました。私たちのデータセットは3台のカメラで同時に記録され、さまざまな現実のシナリオをカバーしています。映像は、さまざまな要約構成の下で複数の個人によって注釈が付けられ、コンセンサス分析によって信頼できるグラウンドトゥルースが保証されます。教師あり設定と教師なし設定の両方でのアプローチの堅牢性と利点を示す他の3つの標準ベンチマークに加えて、コンパイルされたデータセットに対して広範な実験を行います。さらに、私たちのアプローチは、さまざまなビュー数のデータから集合的に学習し、他の要約方法に直交し、スケーラブルで一般的であると見なすことを示します。

With vast amounts of video content being uploaded to the Internet every minute, video summarization becomes critical for efficient browsing, searching, and indexing of visual content. Nonetheless, the spread of social and egocentric cameras creates an abundance of sparse scenarios captured by several devices, and ultimately required to be jointly summarized. In this paper, we discuss the problem of summarizing videos recorded independently by several dynamic cameras that intermittently share the field of view. We present a robust framework that (a) identifies a diverse set of important events among moving cameras that often are not capturing the same scene, and (b) selects the most representative view(s) at each event to be included in a universal summary. Due to the lack of an applicable alternative, we collected a new multi-view egocentric dataset, Multi-Ego. Our dataset is recorded simultaneously by three cameras, covering a wide variety of real-life scenarios. The footage is annotated by multiple individuals under various summarization configurations, with a consensus analysis ensuring a reliable ground truth. We conduct extensive experiments on the compiled dataset in addition to three other standard benchmarks that show the robustness and the advantage of our approach in both supervised and unsupervised settings. Additionally, we show that our approach learns collectively from data of varied number-of-views and orthogonal to other summarization methods, deeming it scalable and generic.

updated: Thu Oct 14 2021 21:52:40 GMT+0000 (UTC)

published: Sat Dec 01 2018 00:44:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト