This paper classifies human action sequences from videos using a machine translation model. In contrast to classical human action classification which outputs a set of actions, our method output a sequence of action in the chronological order of the actions performed by the human. Therefore our method is evaluated using sequential performance measures such as Bilingual Evaluation Understudy (BLEU) scores. Action sequence classification has many applications such as learning from demonstration, action segmentation, detection, localization and video captioning. Furthermore, we use our model that is trained to output action sequences to solve downstream tasks; such as video captioning and action localization. We obtain state of the art results for video captioning in challenging Charades dataset obtaining BLEU-4 score of 34.8 and METEOR score of 33.6 outperforming previous state-of-the-art of 18.8 and 19.5 respectively. Similarly, on ActivityNet captioning, we obtain excellent results in-terms of ROUGE (20.24) and CIDER (37.58) scores. For action localization, without using any explicit start/end action annotations, our method obtains localization performance of 22.2 mAP outperforming prior fully supervised methods.
updated: Mon Oct 07 2019 04:27:01 GMT+0000 (UTC)
published: Mon Oct 07 2019 04:27:01 GMT+0000 (UTC)