Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection

Xinyang Feng; Dongjin Song; Yuncong Chen; Zhengzhang Chen; Jingchao Ni; Haifeng Chen

ビデオ異常検出のための畳み込みトランスフォーマーベースのデュアルディスクリミネーター生成的敵対的ネットワーク

ビデオの異常に関する事前の知識は通常限られているか利用できないため、実際の監視ビデオで異常な活動を検出することは重要ですが困難な作業です。この問題を解決するために多くのアプローチが開発されてきましたが、通常の時空間パターンを効果的かつ効率的にキャプチャできるアプローチはほとんどありません。さらに、既存の作品では、フレームレベルでのローカルの一貫性と、ビデオシーケンスの時間的ダイナミクスのグローバルなコヒーレンスを明示的に考慮することはめったにありません。この目的のために、教師なしビデオ異常検出を実行するために、畳み込みトランスフォーマーベースのデュアルディスクリミネーター生成的敵対的ネットワーク（CT-D2GAN）を提案します。具体的には、最初に、将来のフレーム予測を実行するための畳み込みトランスを提示します。これには、入力ビデオクリップの空間情報をキャプチャする畳み込みエンコーダ、時間ダイナミクスをエンコードする時間自己注意モジュール、および時空間特徴を統合して将来のフレームを予測する畳み込みデコーダの3つの主要コンポーネントが含まれています。次に、フレームレベルで局所的な一貫性を維持できる画像弁別器と時間的ダイナミクスのグローバルコヒーレンスを強制できるビデオ弁別器を共同で検討する二重弁別器ベースの敵対的訓練手順を採用して、将来のフレーム予測を強化します。最後に、予測エラーは異常なビデオフレームを識別するために使用されます。 3つの公開ビデオ異常検出データセット（UCSD Ped2、CUHKアベニュー、上海科技大学）に関する徹底的な実証研究は、提案された敵対的な時空間モデリングフレームワークの有効性を示しています。

Detecting abnormal activities in real-world surveillance videos is an important yet challenging task as the prior knowledge about video anomalies is usually limited or unavailable. Despite that many approaches have been developed to resolve this problem, few of them can capture the normal spatio-temporal patterns effectively and efficiently. Moreover, existing works seldom explicitly consider the local consistency at frame level and global coherence of temporal dynamics in video sequences. To this end, we propose Convolutional Transformer based Dual Discriminator Generative Adversarial Networks (CT-D2GAN) to perform unsupervised video anomaly detection. Specifically, we first present a convolutional transformer to perform future frame prediction. It contains three key components, i.e., a convolutional encoder to capture the spatial information of the input video clips, a temporal self-attention module to encode the temporal dynamics, and a convolutional decoder to integrate spatio-temporal features and predict the future frame. Next, a dual discriminator based adversarial training procedure, which jointly considers an image discriminator that can maintain the local consistency at frame-level and a video discriminator that can enforce the global coherence of temporal dynamics, is employed to enhance the future frame prediction. Finally, the prediction error is used to identify abnormal video frames. Thoroughly empirical studies on three public video anomaly detection datasets, i.e., UCSD Ped2, CUHK Avenue, and Shanghai Tech Campus, demonstrate the effectiveness of the proposed adversarial spatio-temporal modeling framework.

updated: Thu Jul 29 2021 03:07:25 GMT+0000 (UTC)

published: Thu Jul 29 2021 03:07:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト