End-to-end Temporal Action Detection with Transformer

Xiaolong Liu; Qimeng Wang; Yao Hu; Xu Tang; Song Bai; Xiang Bai

Transformerを使用したエンドツーエンドの時間的アクション検出

時間的アクション検出（TAD）は、トリミングされていないビデオ内のすべてのアクションインスタンスのセマンティックラベルと境界を決定することを目的としています。これは、ビデオの理解における基本的でやりがいのある作業であり、大きな進歩が見られました。以前の方法には、複数のステージまたはネットワークと、効率と柔軟性が不足している手動で設計されたルールまたは操作が含まれます。このホワイトペーパーでは、TadTRと呼ばれるTAD on Transformerのエンドツーエンドのフレームワークを提案します。これは、学習可能な埋め込みのセットをアクションインスタンスに並行してマッピングします。 TadTRは、ビデオ内のスニペットのまばらなセットに選択的に参加することにより、アクション予測を行うために必要な時間的コンテキスト情報を適応的に抽出できます。その結果、TADのパイプラインが簡素化され、以前の検出器よりも計算コストが低くなり、優れた検出性能が維持されます。 TadTRは、HACSセグメントで最先端のパフォーマンスを実現します（平均mAP + 3.35％）。単一ネットワークの検出器として、TadTRは同等の競合他社よりも10倍高速に動作します。 THUMOS14（+ 5.0％平均mAP）およびActivityNet（+ 7.53％平均mAP）で、既存の単一ネットワーク検出器を大幅に上回っています。他の検出器と組み合わせると、THUMOS14ではIoU = 0.5で54.1％のmAPが報告され、ActivityNet-1.3では34.55％の平均mAPが報告されます。私たちのコードはhttps://github.com/xlliu7/TadTRでリリースされます。

Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding and significant progress has been made. Previous methods involve multiple stages or networks and hand-designed rules or operations, which fall short in efficiency and flexibility. In this paper, we propose an end-to-end framework for TAD upon Transformer, termed TadTR, which maps a set of learnable embeddings to action instances in parallel. TadTR is able to adaptively extract temporal context information required for making action predictions, by selectively attending to a sparse set of snippets in a video. As a result, it simplifies the pipeline of TAD and requires lower computation cost than previous detectors, while preserving remarkable detection performance. TadTR achieves state-of-the-art performance on HACS Segments (+3.35% average mAP). As a single-network detector, TadTR runs 10× faster than its comparable competitor. It outperforms existing single-network detectors by a large margin on THUMOS14 (+5.0% average mAP) and ActivityNet (+7.53% average mAP). When combined with other detectors, it reports 54.1% mAP at IoU=0.5 on THUMOS14, and 34.55% average mAP on ActivityNet-1.3. Our code will be released at https://github.com/xlliu7/TadTR.

updated: Wed Jul 14 2021 14:54:58 GMT+0000 (UTC)

published: Fri Jun 18 2021 17:58:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト