Streaming Video Temporal Action Segmentation In Real Time

Wujun Wen; Yunheng Li; Zhuben Dong; Lin Feng; Wanxiao Yang; Shenlan Liu

リアルタイムでのストリーミングビデオの時間的アクションセグメンテーション

時間アクションセグメンテーション (TAS) は、ビデオの長期的な理解に向けた重要なステップです。最近の研究では、生のビデオ画像情報ではなく、特徴に基づいてモデルを構築するパターンに従っています。ただし、これらのモデルは複雑にトレーニングされており、アプリケーションシナリオが制限されていると主張しています。完全なビデオ機能が抽出された後に機能する必要があるため、ビデオの人間の行動をリアルタイムでセグメント化することは困難です。リアルタイムアクションセグメンテーションタスクは TAS タスクとは異なるため、ストリーミングビデオリアルタイム一時的アクションセグメンテーション (SVTAS) タスクとして定義します。この論文では、SVTASタスクのリアルタイムエンドツーエンドマルチモダリティモデルを提案します。より具体的には、将来の情報が得られない状況下で、ストリーミングビデオチャンクの現在の人間の行動をリアルタイムでセグメント化します。さらに、私たちが提案するモデルは、言語モデルによって抽出された最後のストリーミングビデオチャンクの特徴と、画像モデルによって抽出された現在の画像特徴を組み合わせて、リアルタイムの一時的なアクションセグメンテーションの量を改善します。私たちの知る限りでは、これは最初のマルチモダリティリアルタイム時間アクションセグメンテーションモデルです。完全なビデオの時間アクションセグメンテーションと同じ評価基準の下で、私たちのモデルは、最先端のモデル計算の 40% 未満で人間のアクションをリアルタイムでセグメント化し、完全なビデオの最新の精度の 90% を達成します。アートモデル。

Temporal action segmentation (TAS) is a critical step toward long-term video understanding. Recent studies follow a pattern that builds models based on features instead of raw video picture information. However, we claim those models are trained complicatedly and limit application scenarios. It is hard for them to segment human actions of video in real time because they must work after the full video features are extracted. As the real-time action segmentation task is different from TAS task, we define it as streaming video real-time temporal action segmentation (SVTAS) task. In this paper, we propose a real-time end-to-end multi-modality model for SVTAS task. More specifically, under the circumstances that we cannot get any future information, we segment the current human action of streaming video chunk in real time. Furthermore, the model we propose combines the last steaming video chunk feature extracted by language model with the current image feature extracted by image model to improve the quantity of real-time temporal action segmentation. To the best of our knowledge, it is the first multi-modality real-time temporal action segmentation model. Under the same evaluation criteria as full video temporal action segmentation, our model segments human action in real time with less than 40% of state-of-the-art model computation and achieves 90% of the accuracy of the full video state-of-the-art model.

updated: Tue Oct 10 2023 03:00:48 GMT+0000 (UTC)

published: Wed Sep 28 2022 03:27:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト