Recently, Weakly-supervised Temporal Action Localization (WTAL) has been densely studied but there is still a large gap between weakly-supervised models and fully-supervised models. It is practical and intuitive to annotate temporal boundaries of a few examples and utilize them to help WTAL models better detect actions. However, the train-test discrepancy of action localization strategy prevents WTAL models from leveraging semi-supervision for further improvement. At training time, attention or multiple instance learning is used to aggregate predictions of each snippet for video-level classification; at test time, they first obtain action score sequences over time, then truncate segments of scores higher than a fixed threshold, and post-process action segments. The inconsistent strategy makes it hard to explicitly supervise the action localization model with temporal boundary annotations at training time. In this paper, we propose a Train-Test Consistent framework, TTC-Loc. In both training and testing time, our TTC-Loc localizes actions by comparing scores of action classes and predicted threshold, which enables it to be trained with semi-supervision. By fixing the train-test discrepancy, our TTC-Loc significantly outperforms the state-of-the-art performance on THUMOS'14, ActivityNet 1.2 and 1.3 when only video-level labels are provided for training. With full annotations of only one video per class and video-level labels for the other videos, our TTC-Loc further boosts the performance and achieves 33.4% mAP (IoU threshold 0.5) on THUMOS's 14.
updated: Mon Mar 23 2020 02:56:39 GMT+0000 (UTC)
published: Thu Oct 24 2019 17:00:14 GMT+0000 (UTC)