Diffusion Action Segmentation

Daochang Liu; Qiyue Li; AnhDung Dinh; Tingting Jiang; Mubarak Shah; Chang Xu

拡散アクションセグメンテーション

時間的なアクションのセグメンテーションは、長い形式のビデオを理解するために重要です。このタスクに関する以前の作品は、通常、多段階モデルを使用して反復改良パラダイムを採用しています。私たちの論文は、拡散モデルのノイズ除去を介して本質的に異なるフレームワークを提案していますが、それにもかかわらず、そのような反復的な改良の同じ固有の精神を共有しています。このフレームワークでは、入力ビデオの特徴を条件として、ランダムノイズからアクション予測が段階的に生成されます。事前の位置、境界のあいまいさ、関係依存性など、人間の行動の 3 つの顕著な特徴のモデリングを強化するために、フレームワークの条件付け入力に対して統一されたマスキング戦略を考案しました。 GTEA、50Salads、Breakfast の 3 つのベンチマークデータセットで広範な実験が行われ、提案された方法は最先端の方法よりも優れた、または同等の結果を達成し、アクションセグメンテーションに対する生成的アプローチの有効性を示しています。私たちのコードが利用可能になります。

Temporal action segmentation is crucial for understanding long-form videos. Previous works on this task commonly adopt an iterative refinement paradigm by using multi-stage models. Our paper proposes an essentially different framework via denoising diffusion models, which nonetheless shares the same inherent spirit of such iterative refinement. In this framework, action predictions are progressively generated from random noise with input video features as conditions. To enhance the modeling of three striking characteristics of human actions, including the position prior, the boundary ambiguity, and the relational dependency, we devise a unified masking strategy for the conditioning inputs in our framework. Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and Breakfast, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action segmentation. Our codes will be made available.

updated: Fri Mar 31 2023 10:53:24 GMT+0000 (UTC)

published: Fri Mar 31 2023 10:53:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト