ETAD: A Unified Framework for Efficient Temporal Action Detection

Shuming Liu; Mengmeng Xu; Chen Zhao; Xu Zhao; Bernard Ghanem

ETAD：効率的な時間的アクション検出のための統合フレームワーク

時間的アクション検出（TAD）などのトリミングされていないビデオの理解は、コンピューティングリソースに対する膨大な需要の痛みに悩まされることがよくあります。ビデオの長さが長く、GPUメモリが限られているため、ほとんどのアクション検出器は、元のビデオではなく事前に抽出された機能でのみ動作でき、高い検出パフォーマンスを実現するには、依然として多くの計算が必要です。 TADの重い計算問題を軽減するために、この作業では、最初に、少数の提案でパフォーマンスが飽和するという観察に基づいて、検出器提案サンプリングを備えた効率的なアクション検出器を提案します。この検出器は、LSTMでブーストされた時間的集約やカスケードされた提案の改良など、いくつかの重要な手法を使用して設計されており、高い検出品質と低い計算コストを実現します。このアクション検出器と機能エンコーダーの共同最適化を可能にするために、エンコーダー勾配サンプリングも提案します。これは、ビデオスニペットを選択的に逆伝播し、GPUメモリ消費を大幅に削減します。 2つのサンプリング戦略と効果的な検出器を使用して、効率的なエンドツーエンドの時間的アクション検出（ETAD）のための統合フレームワークを構築し、実際のトリミングされていないビデオの理解を扱いやすくします。 ETADは、THUMOS-14とActivityNet-1.3の両方で最先端のパフォーマンスを実現します。興味深いことに、ActivityNet-1.3では、平均mAPが37.78％に達しますが、事前に抽出された機能に基づいて、6分のトレーニング時間と1.23GBのメモリしか必要としません。エンドツーエンドのトレーニングにより、従来のエンドツーエンドの方法と比較して、GPUメモリフットプリントが70％以上削減され、パフォーマンスがさらに向上します（平均mAPは38.21％）。コードはhttps://github.com/sming256/ETADで入手できます。

Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources. Because of long video durations and limited GPU memory, most action detectors can only operate on pre-extracted features rather than the original videos, and they still require a lot of computation to achieve high detection performance. To alleviate the heavy computation problem in TAD, in this work, we first propose an efficient action detector with detector proposal sampling, based on the observation that performance saturates at a small number of proposals. This detector is designed with several important techniques, such as LSTM-boosted temporal aggregation and cascaded proposal refinement to achieve high detection quality as well as low computational cost. To enable joint optimization of this action detector and the feature encoder, we also propose encoder gradient sampling, which selectively back-propagates through video snippets and tremendously reduces GPU memory consumption. With the two sampling strategies and the effective detector, we build a unified framework for efficient end-to-end temporal action detection (ETAD), making real-world untrimmed video understanding tractable. ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3. Interestingly, on ActivityNet-1.3, it reaches 37.78% average mAP, while only requiring 6 mins of training time and 1.23 GB memory based on pre-extracted features. With end-to-end training, it reduces the GPU memory footprint by more than 70% with even higher performance (38.21% average mAP), as compared with traditional end-to-end methods. The code is available at https://github.com/sming256/ETAD.

updated: Sat May 14 2022 21:16:21 GMT+0000 (UTC)

published: Sat May 14 2022 21:16:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト