Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM

Zahidul Islam; Mohammad Rukonuzzaman; Raiyan Ahmed; Md. Hasanul Kabir; Moshiur Farazi

分離可能な畳み込みLSTMを使用した暴力検出のための効率的な2ストリームネットワーク

監視映像からの暴力の自動検出は、無人セキュリティ監視システム、インターネットビデオフィルタリングなどに幅広く適用できるため、特別な注意が必要な行動認識のサブセットです。この作業では、分離可能な畳み込みを活用した効率的な2ストリーム深層学習アーキテクチャを提案します。 LSTM（SepConvLSTM）および事前トレーニング済みのMobileNet。1つのストリームがバックグラウンドで抑制されたフレームを入力として取り込み、他のストリームが隣接するフレームの違いを処理します。移動しない背景を抑制し、フレーム間の動きをキャプチャすることで、フレーム内の移動するオブジェクトを強調表示する、シンプルで高速な入力前処理技術を採用しました。暴力的な行動は主に体の動きによって特徴付けられるため、これらの入力は識別可能な特徴を生み出すのに役立ちます。 SepConvLSTMは、ConvLSTMの各ゲートでの畳み込み演算を、大幅に少ないパラメーターを使用しながら堅牢な長距離時空間特徴を生成できる深さ方向に分離可能な畳み込みに置き換えることによって構築されます。 2つのストリームの出力特徴マップを組み合わせるために、3つの融合方法を実験しました。提案された方法の評価は、3つの標準的な公開データセットで行われました。私たちのモデルは、より大きくてより挑戦的なRWF-2000データセットの精度を2％以上上回っていますが、より小さなデータセットの最先端の結果と一致しています。私たちの実験は、提案されたモデルが計算効率と検出精度の両方の点で優れているという結論に導きます。

Automatically detecting violence from surveillance footage is a subset of activity recognition that deserves special attention because of its wide applicability in unmanned security monitoring systems, internet video filtration, etc. In this work, we propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet where one stream takes in background suppressed frames as inputs and other stream processes difference of adjacent frames. We employed simple and fast input pre-processing techniques that highlight the moving objects in the frames by suppressing non-moving backgrounds and capture the motion in-between frames. As violent actions are mostly characterized by body movements these inputs help produce discriminative features. SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution that enables producing robust long-range Spatio-temporal features while using substantially fewer parameters. We experimented with three fusion methods to combine the output feature maps of the two streams. Evaluation of the proposed methods was done on three standard public datasets. Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin while matching state-of-the-art results on the smaller datasets. Our experiments lead us to conclude, the proposed models are superior in terms of both computational efficiency and detection accuracy.

updated: Sun Apr 18 2021 10:14:39 GMT+0000 (UTC)

published: Sun Feb 21 2021 12:01:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト