PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

Hanlin Wang; Yilu Wu; Sheng Guo; Limin Wang

PDPP:教育ビデオにおける手順計画の予測拡散

この論文では、構造化されていない実生活のビデオで現在の視覚的観察を考慮して、目標指向の計画を立てることを目的とした教育ビデオの手順計画の問題を研究します。以前の研究では、この問題をシーケンス計画の問題として投げかけ、重度の中間の視覚的観察または自然言語の指示を監督として活用していたため、複雑な学習スキームと高価な注釈コストが発生していました。対照的に、この問題を分布フィッティング問題として扱います。この意味で、拡散モデル（PDPP）を使用して中間アクションシーケンス分布全体をモデル化し、計画問題をこの分布からのサンプリングプロセスに変換します。さらに、高価な中間スーパーバイザーを削除し、代わりに教育ビデオのタスクラベルをスーパーバイザーとして使用するだけです。私たちのモデルは U-Net ベースの拡散モデルであり、与えられた開始観測と終了観測で学習された分布からアクションシーケンスを直接サンプリングします。さらに、効率的な射影法を適用して、学習およびサンプリングプロセス中にモデルに正確な条件付きガイドを提供します。異なるスケールの 3 つのデータセットでの実験では、タスクの監視がなくても、PDPP モデルが複数のメトリックで最先端のパフォーマンスを達成できることが示されています。コードとトレーニング済みモデルは、https://github.com/MCG-NJU/PDPP で入手できます。

In this paper, we study the problem of procedure planning in instructional videos, which aims to make goal-directed plans given the current visual observations in unstructured real-life videos. Previous works cast this problem as a sequence planning problem and leverage either heavy intermediate visual observations or natural language instructions as supervision, resulting in complex learning schemes and expensive annotation costs. In contrast, we treat this problem as a distribution fitting problem. In this sense, we model the whole intermediate action sequence distribution with a diffusion model (PDPP), and thus transform the planning problem to a sampling process from this distribution. In addition, we remove the expensive intermediate supervision, and simply use task labels from instructional videos as supervision instead. Our model is a U-Net based diffusion model, which directly samples action sequences from the learned distribution with the given start and end observations. Furthermore, we apply an efficient projection method to provide accurate conditional guides for our model during the learning and sampling process. Experiments on three datasets with different scales show that our PDPP model can achieve the state-of-the-art performance on multiple metrics, even without the task supervision. Code and trained models are available at https://github.com/MCG-NJU/PDPP.

updated: Sun Mar 26 2023 10:50:16 GMT+0000 (UTC)

published: Sun Mar 26 2023 10:50:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト