Diffusion Action Segmentation

Daochang Liu; Qiyue Li; AnhDung Dinh; Tingting Jiang; Mubarak Shah; Chang Xu

拡散アクションのセグメンテーション

時間的なアクションのセグメンテーションは、長時間のビデオを理解するために重要です。このタスクに関するこれまでの研究では、通常、多段階モデルを使用した反復改良パラダイムが採用されています。我々は、ノイズ除去拡散モデルを介した新しいフレームワークを提案しますが、それにも関わらず、このような反復改良の同じ本質的な精神を共有しています。このフレームワークでは、入力ビデオの特徴を条件として、ランダムノイズからアクション予測が繰り返し生成されます。事前位置、境界の曖昧さ、関係依存性など、人間の行動の 3 つの顕著な特性のモデリングを強化するために、フレームワーク内の条件付け入力に対する統一されたマスキング戦略を考案します。 GTEA、50Salads、Breakfast の 3 つのベンチマークデータセットに対する広範な実験が実行され、提案された方法は最先端の方法よりも優れた、または同等の結果を達成し、アクションセグメンテーションに対する生成的アプローチの有効性を示しています。

Temporal action segmentation is crucial for understanding long-form videos. Previous works on this task commonly adopt an iterative refinement paradigm by using multi-stage models. We propose a novel framework via denoising diffusion models, which nonetheless shares the same inherent spirit of such iterative refinement. In this framework, action predictions are iteratively generated from random noise with input video features as conditions. To enhance the modeling of three striking characteristics of human actions, including the position prior, the boundary ambiguity, and the relational dependency, we devise a unified masking strategy for the conditioning inputs in our framework. Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and Breakfast, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action segmentation.

updated: Sat Aug 12 2023 02:13:51 GMT+0000 (UTC)

published: Fri Mar 31 2023 10:53:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト