An Attention-guided Multistream Feature Fusion Network for Localization of Risky Objects in Driving Videos

Muhammad Monjurul Karim; Ruwen Qin; Zhaozheng Yin

運転中のビデオにおける危険なオブジェクトのローカリゼーションのための注意ガイド付きマルチストリーム機能融合ネットワーク

複雑な環境で安全なナビゲーションを促進するには、車載ダッシュボードカメラ (ダッシュカム) で撮影されたビデオで危険な交通手段を検出することが不可欠です。事故関連のビデオは、運転ビデオのビッグデータのほんの一部であり、事故前の一時的なプロセスは非常に動的で複雑です。さらに、危険なトラフィックエージェントと危険でないトラフィックエージェントは、外観が似ている場合があります。これらは、運転中のビデオにおける危険なオブジェクトのローカリゼーションを特に困難にします。この目的のために、この論文では、注意誘導マルチストリーム機能融合ネットワーク (AM-Net) を提案して、ダッシュカムビデオから危険な交通エージェントをローカライズします。 2 つの Gated Recurrent Unit (GRU) ネットワークは、連続するビデオフレームから抽出されたオブジェクトバウンディングボックスとオプティカルフロー機能を使用して、危険なトラフィックエージェントを区別するための時空間キューをキャプチャします。 GRU と結合された注意モジュールは、事故に関連する交通機関に注意を向けることを学習します。 AM-Net は、2 つの特徴のストリームを融合して、ビデオ内のトラフィックエージェントのリスクスコアを予測します。この研究をサポートするために、この論文では、Risky Object Localization (ROL) と呼ばれるベンチマークデータセットも紹介しています。データセットには、事故、オブジェクト、およびシーンレベルの属性を持つ空間、時間、およびカテゴリの注釈が含まれています。提案された AM-Net は、ROL データセットで 85.73% AUC という有望なパフォーマンスを達成します。一方、AM-Net は、DoTA データセットで 6.3% の AUC により、ビデオ異常検出の現在の最先端技術を上回っています。徹底的なアブレーション研究により、AM-Net のさまざまなコンポーネントの寄与を評価することで、AM-Net のメリットがさらに明らかになります。

Detecting dangerous traffic agents in videos captured by vehicle-mounted dashboard cameras (dashcams) is essential to facilitate safe navigation in a complex environment. Accident-related videos are just a minor portion of the driving video big data, and the transient pre-accident processes are highly dynamic and complex. Besides, risky and non-risky traffic agents can be similar in their appearance. These make risky object localization in the driving video particularly challenging. To this end, this paper proposes an attention-guided multistream feature fusion network (AM-Net) to localize dangerous traffic agents from dashcam videos. Two Gated Recurrent Unit (GRU) networks use object bounding box and optical flow features extracted from consecutive video frames to capture spatio-temporal cues for distinguishing dangerous traffic agents. An attention module coupled with the GRUs learns to attend to the traffic agents relevant to an accident. Fusing the two streams of features, AM-Net predicts the riskiness scores of traffic agents in the video. In supporting this study, the paper also introduces a benchmark dataset called Risky Object Localization (ROL). The dataset contains spatial, temporal, and categorical annotations with the accident, object, and scene-level attributes. The proposed AM-Net achieves a promising performance of 85.73% AUC on the ROL dataset. Meanwhile, the AM-Net outperforms current state-of-the-art for video anomaly detection by 6.3% AUC on the DoTA dataset. A thorough ablation study further reveals AM-Net's merits by evaluating the contributions of its different components.

updated: Fri Sep 16 2022 13:36:28 GMT+0000 (UTC)

published: Fri Sep 16 2022 13:36:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト