End-to-end Temporal Action Detection with Transformer

Xiaolong Liu; Qimeng Wang; Yao Hu; Xu Tang; Shiwei Zhang; Song Bai; Xiang Bai

Transformer を使用したエンドツーエンドの一時的なアクション検出

時間アクション検出 (TAD) は、トリミングされていないビデオ内のすべてのアクションインスタンスのセマンティックラベルと時間間隔を決定することを目的としています。これは、ビデオの理解における基本的で挑戦的なタスクです。以前の方法では、複雑なパイプラインを使用してこのタスクに取り組んでいました。多くの場合、複数のネットワークをトレーニングする必要があり、非最大抑制やアンカー生成などの手作業で設計された操作が必要になるため、柔軟性が制限され、エンドツーエンドの学習が妨げられます。この論文では、TadTR と呼ばれる、エンドツーエンドの Transformer ベースの TAD の方法を提案します。アクションクエリと呼ばれる学習可能な埋め込みの小さなセットが与えられると、TadTR は各クエリのビデオから一時的なコンテキスト情報を適応的に抽出し、コンテキストを使用してアクションインスタンスを直接予測します。 Transformer を TAD に適応させるために、局所性認識を強化するための 3 つの改善点を提案します。コアは、ビデオ内のキースニペットのまばらなセットに選択的に注意を向ける一時的な変形可能な注意モジュールです。セグメント改良メカニズムと行動回帰ヘッドは、予測されたインスタンスの境界と信頼度をそれぞれ改良するように設計されています。このような単純なパイプラインにより、TadTR は以前の検出器よりも低い計算コストで済み、優れたパフォーマンスを維持しています。自己完結型の検出器として、THUMOS14 (56.7% mAP) および HACS セグメント (32.09% mAP) で最先端のパフォーマンスを実現します。追加のアクション分類子と組み合わせると、ActivityNet-1.3 で 36.75% の mAP が得られます。コードは https://github.com/xlliu7/TadTR で入手できます。

Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.

updated: Thu Aug 11 2022 14:04:47 GMT+0000 (UTC)

published: Fri Jun 18 2021 17:58:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト