Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video

Jie Wu; Wei Zhang; Guanbin Li; Wenhao Wu; Xiao Tan; Yingying Li; Errui Ding; Liang Lin

監視ビデオでの弱く監視された時空間異常検出

この論文では、監視ビデオで弱教師あり時空間異常検出（WSSTAD）と呼ばれる新しいタスクを紹介します。具体的には、トリミングされていないビデオが与えられた場合、WSSTADは、トレーニング中の監視として粗いビデオレベルの注釈のみを使用して、異常なイベントを囲む時空間チューブ（つまり、連続する一連のバウンディングボックス）をローカライズすることを目的としています。この困難なタスクに対処するために、両方の時空間ドメインでマルチグラニュラリティを持つ提案を入力として受け取るデュアルブランチネットワークを提案します。各ブランチは、関係推論モジュールを使用して、チューブ/ビデオレット間の相関関係をキャプチャします。これにより、異常な動作の概念学習のための豊富なコンテキスト情報と複雑なエンティティ関係を提供できます。相互誘導プログレッシブリファインメントフレームワークは、デュアルパス相互ガイダンスを繰り返し使用するように設定されており、ブランチ間で補助監視情報を繰り返し共有します。これにより、各ブランチの学習された概念が対応するブランチのガイドとして機能し、対応するブランチとフレームワーク全体が徐々に洗練されていきます。さらに、WSSTADのベンチマークとして機能する時空間異常アノテーションを含むビデオで構成される、ST-UCF-CrimeとSTRAの2つのデータセットを提供します。提案されたアプローチの有効性を実証し、このタスクの処理にさらに貢献する主要な要因を分析するために、広範な定性的および定量的評価を実施します。

In this paper, we introduce a novel task, referred to as Weakly-Supervised Spatio-Temporal Anomaly Detection (WSSTAD) in surveillance video. Specifically, given an untrimmed video, WSSTAD aims to localize a spatio-temporal tube (i.e., a sequence of bounding boxes at consecutive times) that encloses the abnormal event, with only coarse video-level annotations as supervision during training. To address this challenging task, we propose a dual-branch network which takes as input the proposals with multi-granularities in both spatial-temporal domains. Each branch employs a relationship reasoning module to capture the correlation between tubes/videolets, which can provide rich contextual information and complex entity relationships for the concept learning of abnormal behaviors. Mutually-guided Progressive Refinement framework is set up to employ dual-path mutual guidance in a recurrent manner, iteratively sharing auxiliary supervision information across branches. It impels the learned concepts of each branch to serve as a guide for its counterpart, which progressively refines the corresponding branch and the whole framework. Furthermore, we contribute two datasets, i.e., ST-UCF-Crime and STRA, consisting of videos containing spatio-temporal abnormal annotations to serve as the benchmarks for WSSTAD. We conduct extensive qualitative and quantitative evaluations to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to handle this task.

updated: Mon Aug 09 2021 06:11:14 GMT+0000 (UTC)

published: Mon Aug 09 2021 06:11:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト