arXiv reaDer
CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos
Weakly supervised video object localization (WSVOL) methods often rely on visual and motion cues only, making them susceptible to inaccurate localization. Recently, discriminative models have been explored using a temporal class activation mapping (CAM) method. Although their results are promising, objects are assumed to have limited movement from frame to frame, leading to degradation in performance for relatively long-term dependencies. In this paper, a novel CoLo-CAM method for WSVOL is proposed that leverages spatiotemporal information in activation maps during training without making assumptions about object position. Given a sequence of frames, explicit joint learning of localization is produced based on color cues across these maps, by assuming that an object has similar color across adjacent frames. CAM activations are constrained to respond similarly over pixels with similar colors, achieving co-localization. This joint learning creates direct communication among pixels across all image locations and over all frames, allowing for transfer, aggregation, and correction of learned localization, leading to better localization performance. This is achieved by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs. Empirical experiments on two challenging datasets with unconstrained videos, YouTube-Objects, show the merits of our method, and its robustness to long-term dependencies, leading to new state-of-the-art performance for WSVOL.
updated: Sat Sep 02 2023 00:01:48 GMT+0000 (UTC)
published: Thu Mar 16 2023 02:29:53 GMT+0000 (UTC)
参考文献 (このサイトで利用可能なもの) / References (only if available on this site)
被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)
Amazon.co.jpアソシエイト