Video Abnormal Event Detection by Learning to Complete Visual Cloze Tests

Siqi Wang; Guang Yu; Zhiping Cai; Xinwang Liu; En Zhu; Jianping Yin

ビジュアルクローズテストの完了を学習することによるビデオ異常イベントの検出

ディープニューラルネットワーク（DNN）は、ビデオ異常イベント検出（VAD）の大きな進歩を可能にしますが、既存のソリューションには通常、次の2つの問題があります。（1）ビデオイベントのローカリゼーションは、貴重で包括的なものにすることはできません。（2）セマンティクスと時間的コンテキストは十分に検討されていません。これらの問題に取り組むために、私たちは教育で普及しているクローズテストに動機付けられ、「ビジュアルクローズテスト」（VCT）の完了を学習してVADを実行するビジュアルクローズ完了（VCC）という新しいアプローチを提案します。具体的には、VCCは最初に各ビデオイベントをローカライズし、それを時空間キューブ（STC）に囲みます。正確で包括的なローカリゼーションを実現するために、外観と動きは、各イベントに関連付けられたオブジェクト領域をマークするための補完的な手がかりとして使用されます。マークされた領域ごとに、正規化されたパッチシーケンスが現在のフレームと隣接するフレームから抽出され、STCにスタックされます。 STCの各パッチとパッチシーケンスをそれぞれ視覚的な「単語」と「文」と比較して、特定の「単語」（パッチ）を意図的に消去してVCTを生成します。次に、ビデオセマンティクスを介して消去されたパッチとそのオプティカルフローを推測するようにDNNをトレーニングすることにより、VCTが完了します。一方、VCCは、時間コンテキスト内の各パッチを交互に消去し、複数のVCTを作成することにより、時間コンテキストを完全に活用します。さらに、ローカリゼーションレベル、イベントレベル、モデルレベル、および意思決定レベルのソリューションを提案して、VCCを強化します。これにより、VCCの可能性をさらに活用し、パフォーマンスを大幅に向上させることができます。広範な実験により、VCCが最先端のVADパフォーマンスを実現していることが実証されています。コードと結果はhttps://github.com/yuguangnudt/VEC_VAD/tree/VCCで公開されています。

Although deep neural networks (DNNs) enable great progress in video abnormal event detection (VAD), existing solutions typically suffer from two issues: (1) The localization of video events cannot be both precious and comprehensive. (2) The semantics and temporal context are under-explored. To tackle those issues, we are motivated by the prevalent cloze test in education and propose a novel approach named Visual Cloze Completion (VCC), which conducts VAD by learning to complete "visual cloze tests" (VCTs). Specifically, VCC first localizes each video event and encloses it into a spatio-temporal cube (STC). To achieve both precise and comprehensive localization, appearance and motion are used as complementary cues to mark the object region associated with each event. For each marked region, a normalized patch sequence is extracted from current and adjacent frames and stacked into a STC. With each patch and the patch sequence of a STC compared to a visual "word" and "sentence" respectively, we deliberately erase a certain "word" (patch) to yield a VCT. Then, the VCT is completed by training DNNs to infer the erased patch and its optical flow via video semantics. Meanwhile, VCC fully exploits temporal context by alternatively erasing each patch in temporal context and creating multiple VCTs. Furthermore, we propose localization-level, event-level, model-level and decision-level solutions to enhance VCC, which can further exploit VCC's potential and produce significant performance improvement gain. Extensive experiments demonstrate that VCC achieves state-of-the-art VAD performance. Our codes and results are open at https://github.com/yuguangnudt/VEC_VAD/tree/VCC.

updated: Fri Sep 17 2021 02:27:02 GMT+0000 (UTC)

published: Thu Aug 05 2021 04:05:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト