Learning to Summarize Videos by Contrasting Clips

Ivan Sosnovik; Artem Moskalev; Cees Kaandorp; Arnold Smeulders

クリップを対照してビデオを要約することを学ぶ

ビデオの要約は、元のストーリーにできるだけ近いストーリーを語るビデオの部分を選択することを目的としています。既存のビデオ要約アプローチのほとんどは、手作りのラベルに焦点を当てています。ビデオの数が指数関数的に増加するにつれて、ラベル付けされた注釈なしで意味のある要約を学習できる方法の必要性が高まっています。このホワイトペーパーでは、アドオンとしていくつかのパーソナライズされたラベルに監視を集中させながら、監視されていないビデオの要約を最大限に活用することを目指しています。そのために、有益なビデオ要約の主要な要件を定式化します。次に、両方の質問に対する答えとして、対照学習を提案します。 Contrastive video Summarization (CSUM) をさらに強化するために、微分可能な上位 k 機能セレクターで実装する既存の方法で採用されている平均ビデオ機能ではなく、上位 k 機能を対比することを提案します。いくつかのベンチマークに関する私たちの実験は、ラベル付けされたデータが提供されていない場合でも、私たちのアプローチが意味のある多様な要約を可能にすることを示しています。

Video summarization aims at choosing parts of a video that narrate a story as close as possible to the original one. Most of the existing video summarization approaches focus on hand-crafted labels. As the number of videos grows exponentially, there emerges an increasing need for methods that can learn meaningful summarizations without labeled annotations. In this paper, we aim to maximally exploit unsupervised video summarization while concentrating the supervision to a few, personalized labels as an add-on. To do so, we formulate the key requirements for the informative video summarization. Then, we propose contrastive learning as the answer to both questions. To further boost Contrastive video Summarization (CSUM), we propose to contrast top-k features instead of a mean video feature as employed by the existing method, which we implement with a differentiable top-k feature selector. Our experiments on several benchmarks demonstrate, that our approach allows for meaningful and diverse summaries when no labeled data is provided.

updated: Wed Apr 19 2023 12:09:12 GMT+0000 (UTC)

published: Thu Jan 12 2023 18:55:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト