Motion-aware Memory Network for Fast Video Salient Object Detection

Xing Zhao; Haoran Liang; Peipei Li; Guodao Sun; Dongdong Zhao; Ronghua Liang; Xiaofei He

高速ビデオ顕著なオブジェクト検出のためのモーション認識メモリネットワーク

3DCNN、convLSTM、またはオプティカルフローに基づく以前の方法は、ビデオ顕著オブジェクト検出 (VSOD) で大きな成功を収めています。ただし、計算コストが高いことや、生成された顕著性マップの品質が低いことに依然として悩まされています。これらの問題を解決するために、時空間メモリ (STM) ベースのネットワークを設計します。これは、VSOD の時間ブランチとして、隣接するフレームから現在のフレームの有用な時間情報を抽出します。さらに、以前の方法では、時間的な関連付けのない単一フレームの予測のみが考慮されていました。その結果、モデルは時間情報に十分に焦点を当てていない可能性があります。したがって、最初に、フレーム間でのオブジェクトの動き予測を VSOD に導入します。私たちのモデルは、標準のエンコーダー - デコーダーアーキテクチャに従います。エンコード段階では、現在のフレームと隣接するフレームから高レベルの特徴を使用して、高レベルの時間的特徴を生成します。このアプローチは、オプティカルフローベースの方法よりも効率的です。デコード段階では、空間ブランチと時間ブランチの効果的な融合戦略を提案します。高レベル機能のセマンティック情報を使用して低レベル機能のオブジェクトの詳細を融合し、時空間機能を段階的に取得して顕著性マップを再構築します。さらに、画像の顕著なオブジェクト検出 (ISOD) で一般的に使用される境界監視に触発されて、オブジェクト境界の動きを予測するためのモーション認識損失を設計し、VSOD とオブジェクトの動き予測のマルチタスク学習を同時に実行します。これにより、モデルの抽出がさらに容易になります。時空間特徴を正確に識別し、オブジェクトの整合性を維持します。いくつかのデータセットでの広範な実験により、この方法の有効性が実証され、一部のデータセットで最先端のメトリックを達成できます。提案されたモデルは、オプティカルフローやその他の前処理を必要とせず、推論中にほぼ 100 FPS の速度に達することができます。

Previous methods based on 3DCNN, convLSTM, or optical flow have achieved great success in video salient object detection (VSOD). However, they still suffer from high computational costs or poor quality of the generated saliency maps. To solve these problems, we design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD. Furthermore, previous methods only considered single-frame prediction without temporal association. As a result, the model may not focus on the temporal information sufficiently. Thus, we initially introduce object motion prediction between inter-frame into VSOD. Our model follows standard encoder--decoder architecture. In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames. This approach is more efficient than the optical flow-based methods. In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches. The semantic information of the high-level features is used to fuse the object details in the low-level features, and then the spatiotemporal features are obtained step by step to reconstruct the saliency maps. Moreover, inspired by the boundary supervision commonly used in image salient object detection (ISOD), we design a motion-aware loss for predicting object boundary motion and simultaneously perform multitask learning for VSOD and object motion prediction, which can further facilitate the model to extract spatiotemporal features accurately and maintain the object integrity. Extensive experiments on several datasets demonstrated the effectiveness of our method and can achieve state-of-the-art metrics on some datasets. The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.

updated: Mon Aug 01 2022 15:56:19 GMT+0000 (UTC)

published: Mon Aug 01 2022 15:56:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト