STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

Anshul Shah; Benjamin Lundell; Harpreet Sawhney; Rama Chellappa

STEPs: ラベル付けされていない手続き型ビデオからの自己管理型キーステップ抽出

拡張現実 (AR) ヘッドセットが職業訓練とパフォーマンスに革命をもたらす可能性に動機付けられた、ラベル付けされていない手続き型ビデオから主要なステップを抽出するという問題に対処します。問題を表現学習とキーステップ抽出の 2 つのステップに分解します。一時的なモジュールを使用して既製のビデオ機能を適応させるトレーニング戦略を介して、自己教師あり表現学習を採用しています。トレーニングは、一般化可能な表現を学習するために、ビデオから抽出された外観、動き、ポーズの軌跡などの複数の手がかりを含む自己教師あり学習損失を実装します。私たちの方法は、手続き型ビデオから抽出された表現をクラスター化する調整可能なアルゴリズムを介して重要なステップを抽出します。主要なステップのローカリゼーションを使用してアプローチを定量的に評価し、フェーズ分類などの関連するダウンストリームタスクに対する抽出された表現の有効性を示します。定性的な結果は、抽出された主要なステップが手順タスクを簡潔に表すのに意味があることを示しています。

We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We employ self-supervised representation learning via a training strategy that adapts off-the-shelf video features using a temporal module. Training implements self-supervised learning losses involving multiple cues such as appearance, motion and pose trajectories extracted from videos to learn generalizable representations. Our method extracts key steps via a tunable algorithm that clusters the representations extracted from procedural videos. We quantitatively evaluate our approach with key step localization and also demonstrate the effectiveness of the extracted representations on related downstream tasks like phase classification. Qualitative results demonstrate that the extracted key steps are meaningful to succinctly represent the procedural tasks.

updated: Mon Jan 02 2023 18:32:45 GMT+0000 (UTC)

published: Mon Jan 02 2023 18:32:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト