Implicit Motion Handling for Video Camouflaged Object Detection

Xuelian Cheng; Huan Xiong; Deng-Ping Fan; Yiran Zhong; Mehrtash Harandi; Tom Drummond; Zongyuan Ge

ビデオカモフラージュオブジェクト検出のための暗黙のモーション処理

ビデオフレームからカモフラージュされたオブジェクトを検出するために、短期的なダイナミクスと長期的な時間的一貫性の両方を活用できる新しいビデオカモフラージュされたオブジェクト検出（VCOD）フレームワークを提案します。カモフラージュされたオブジェクトの本質的な特性は、通常、背景に似たパターンを示し、静止画像からの識別を困難にすることです。したがって、ビデオの時間的ダイナミクスを効果的に処理することが、カモフラージュされたオブジェクトが移動したときに目立つため、VCODタスクの鍵になります。ただし、現在のVCOD手法では、ホモグラフィまたはオプティカルフローを利用してモーションを表現することが多く、モーション推定エラーとセグメンテーションエラーの両方から検出エラーが累積する可能性があります。一方、私たちの方法は、単一の最適化フレームワーク内でモーションエスティメーションとオブジェクトセグメンテーションを統合します。具体的には、密な相関ボリュームを構築して、隣接するフレーム間のモーションを暗黙的にキャプチャし、最終的なセグメンテーション監視を利用して、暗黙的なモーション推定とセグメンテーションを共同で最適化します。さらに、ビデオシーケンス内の時間的一貫性を強化するために、時空間トランスフォーマーを共同で利用して、短期予測を改良します。 VCODベンチマークに関する広範な実験は、私たちのアプローチのアーキテクチャ上の有効性を示しています。また、MoCA-Maskという名前の大規模なVCODデータセットに、ピクセルレベルの手作りのグラウンドトゥルースマスクを提供し、この方向での研究を容易にするために、以前の方法で包括的なVCODベンチマークを構築します。データセットリンク：https：//xueliancheng.github.io/SLT-Net-project。

We propose a new video camouflaged object detection (VCOD) framework that can exploit both short-term dynamics and long-term temporal consistency to detect camouflaged objects from video frames. An essential property of camouflaged objects is that they usually exhibit patterns similar to the background and thus make them hard to identify from still images. Therefore, effectively handling temporal dynamics in videos becomes the key for the VCOD task as the camouflaged objects will be noticeable when they move. However, current VCOD methods often leverage homography or optical flows to represent motions, where the detection error may accumulate from both the motion estimation error and the segmentation error. On the other hand, our method unifies motion estimation and object segmentation within a single optimization framework. Specifically, we build a dense correlation volume to implicitly capture motions between neighbouring frames and utilize the final segmentation supervision to optimize the implicit motion estimation and segmentation jointly. Furthermore, to enforce temporal consistency within a video sequence, we jointly utilize a spatio-temporal transformer to refine the short-term predictions. Extensive experiments on VCOD benchmarks demonstrate the architectural effectiveness of our approach. We also provide a large-scale VCOD dataset named MoCA-Mask with pixel-level handcrafted ground-truth masks and construct a comprehensive VCOD benchmark with previous methods to facilitate research in this direction. Dataset Link: https://xueliancheng.github.io/SLT-Net-project.

updated: Tue Mar 15 2022 13:44:01 GMT+0000 (UTC)

published: Mon Mar 14 2022 17:55:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト