End-to-end Temporal Action Detection with Transformer

Xiaolong Liu; Qimeng Wang; Yao Hu; Xu Tang; Song Bai; Xiang Bai

Transformerを使用したエンドツーエンドの時間的アクション検出

時間的アクション検出（TAD）は、トリミングされていないビデオ内のすべてのアクションインスタンスのセマンティックラベルと境界を決定することを目的としています。これはビデオ理解の基本的なタスクであり、TADでは大きな進歩が見られました。以前の方法には、複数のステージまたはネットワークと、効率と柔軟性が不足している手動で設計されたルールまたは操作が含まれます。ここでは、TadTRと呼ばれるTAD on Transformerのエンドツーエンドのフレームワークを構築します。これは、すべてのアクションインスタンスをラベルと時間的位置のセットとして同時に予測します。 TadTRは、ビデオ内の多数のスニペットに選択的に参加することにより、アクション予測を行うために必要な時間的コンテキスト情報を適応的に抽出できます。 TADのパイプラインを大幅に簡素化し、以前の検出器よりもはるかに高速に実行されます。私たちの方法は、HACSセグメントとTHUMOS14で最先端のパフォーマンスを実現し、ActivityNet-1.3で競争力のあるパフォーマンスを実現します。私たちのコードはhttps://github.com/xlliu7/TadTRで利用できるようになります。

Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. It is a fundamental task in video understanding and significant progress has been made in TAD. Previous methods involve multiple stages or networks and hand-designed rules or operations, which fall short in efficiency and flexibility. Here, we construct an end-to-end framework for TAD upon Transformer, termed TadTR, which simultaneously predicts all action instances as a set of labels and temporal locations in parallel. TadTR is able to adaptively extract temporal context information needed for making action predictions, by selectively attending to a number of snippets in a video. It greatly simplifies the pipeline of TAD and runs much faster than previous detectors. Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3. Our code will be made available at https://github.com/xlliu7/TadTR.

updated: Fri Jun 18 2021 17:58:34 GMT+0000 (UTC)

published: Fri Jun 18 2021 17:58:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト