DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

Sauradip Nag; Xiatian Zhu; Jiankang Deng; Yi-Zhe Song; Tao Xiang

DiffTAD: 提案ノイズ除去拡散を使用した時間アクション検出

ノイズ除去拡散を使用した時間アクション検出 (TAD) の新しい定式化、略して DiffTAD を提案します。ランダムな一時的な提案を入力として取り、トリミングされていない長いビデオを指定して、アクションの提案を正確に生成できます。これは、以前の差別的な学習方法に対して、生成モデリングの視点を提示します。この機能は、最初にグラウンドトゥルースの提案をランダムなものに拡散し (つまり、順方向/ノイズ処理)、次にノイズ処理を逆にすることを学習することによって実現されます (つまり、逆方向/ノイズ除去プロセス)。具体的には、トレーニングの収束が速い時間位置クエリ設計を導入することにより、Transformer デコーダ (DETR など) でのノイズ除去プロセスを確立します。さらに、推論を加速するためのクロスステップ選択的条件付けアルゴリズムを提案します。 ActivityNet と THUMOS での広範な評価は、当社の DiffTAD が従来の代替技術と比較して最高のパフォーマンスを達成することを示しています。コードは https://github.com/sauradip/DiffusionTAD で入手できるようになります。

We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previous art alternatives. The code will be made available at https://github.com/sauradip/DiffusionTAD.

updated: Mon Mar 27 2023 00:40:52 GMT+0000 (UTC)

published: Mon Mar 27 2023 00:40:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト