Self-Attention Based Generative Adversarial Networks For Unsupervised Video Summarization

Maria Nektaria Minaidi; Charilaos Papaioannou; Alexandros Potamianos

教師なしビデオ要約のための自己注意ベースの敵対的生成ネットワーク

この論文では、敵対的学習に依存した教師なしアプローチに従って包括的なビデオ概要を作成する問題を研究します。私たちは、敵対的生成ネットワーク (GAN) をトレーニングして、オリジナルと区別できない代表的な概要を作成する一般的な手法に基づいて構築しています。ビデオフレームの選択、エンコード、デコードのためのアーキテクチャへのアテンションメカニズムの導入は、ビデオ要約のための時間的関係のモデル化におけるセルフアテンションとトランスフォーマーの有効性を示しています。私たちは、フレーム選択にセルフアテンションメカニズムを使用し、エンコードとデコードに LSTM を組み合わせた SUM-GAN-AED モデルを提案します。 SumMe、TVSum、COGNIMUSE データセットで SUM-GAN-AED モデルのパフォーマンスを評価します。実験結果は、フレーム選択メカニズムとしてセルフアテンションメカニズムを使用すると、SumMe の最先端のパフォーマンスを上回り、TVSum および COGNIMUSE の最先端のパフォーマンスに匹敵することを示しています。

In this paper, we study the problem of producing a comprehensive video summary following an unsupervised approach that relies on adversarial learning. We build on a popular method where a Generative Adversarial Network (GAN) is trained to create representative summaries, indistinguishable from the originals. The introduction of the attention mechanism into the architecture for the selection, encoding and decoding of video frames, shows the efficacy of self-attention and transformer in modeling temporal relationships for video summarization. We propose the SUM-GAN-AED model that uses a self-attention mechanism for frame selection, combined with LSTMs for encoding and decoding. We evaluate the performance of the SUM-GAN-AED model on the SumMe, TVSum and COGNIMUSE datasets. Experimental results indicate that using a self-attention mechanism as the frame selection mechanism outperforms the state-of-the-art on SumMe and leads to comparable to state-of-the-art performance on TVSum and COGNIMUSE.

updated: Sun Jul 16 2023 19:56:13 GMT+0000 (UTC)

published: Sun Jul 16 2023 19:56:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト