Enabling Weakly-Supervised Temporal Action Localization from On-Device Learning of the Video Stream

Yue Tang; Yawen Wu; Peipei Zhou; Jingtong Hu

ビデオストリームのデバイス上での学習から、弱い監視下での一時的なアクションのローカリゼーションを有効にする

ビデオ内のアクションの検出は、オンデバイスアプリケーションで広く適用されています。実用的なオンデバイスビデオは、アクションと背景の両方で常にトリミングされていません。モデルがアクションのクラスを認識し、アクションが発生する時間的位置をローカライズすることが望ましいです。このようなタスクは時間アクションロケーション (TAL) と呼ばれ、トリミングされていない複数のビデオが収集されてラベル付けされるクラウド上で常にトレーニングされます。 TAL モデルが新しいデータから継続的かつローカルに学習することが望ましいです。これにより、顧客のプライバシーを保護しながらアクション検出の精度を直接向上させることができます。ただし、一時的な注釈を含む膨大なビデオサンプルが必要になるため、TAL モデルのトレーニングは簡単ではありません。ただし、フレームごとにビデオに注釈を付けるには、法外な時間と費用がかかります。ビデオレベルのラベルのみを使用してトリミングされていないビデオから学習するために、弱い教師あり TAL (W-TAL) が提案されていますが、このようなアプローチはオンデバイス学習シナリオにも適していません。実際のオンデバイス学習アプリケーションでは、データはストリーミングで収集されます。このような長いビデオストリームを複数のビデオセグメントに分割するには、多くの人手が必要です。 W-TAL モデルがトリミングされていない長いストリーミングビデオから学習できるようにするために、新しい環境に直接適応できる効率的なビデオ学習アプローチを提案します。最初に、ビデオストリームを複数のセグメントに変換するためのコントラストスコアベースのセグメントマージアプローチを使用した自己適応ビデオ分割アプローチを提案します。次に、TAL タスクでさまざまなサンプリング戦略を検討して、できるだけ少ないラベルを要求します。私たちの知る限りでは、デバイス上の長いビデオストリームから直接学習する最初の試みです。

Detecting actions in videos have been widely applied in on-device applications. Practical on-device videos are always untrimmed with both action and background. It is desirable for a model to both recognize the class of action and localize the temporal position where the action happens. Such a task is called temporal action location (TAL), which is always trained on the cloud where multiple untrimmed videos are collected and labeled. It is desirable for a TAL model to continuously and locally learn from new data, which can directly improve the action detection precision while protecting customers' privacy. However, it is non-trivial to train a TAL model, since tremendous video samples with temporal annotations are required. However, annotating videos frame by frame is exorbitantly time-consuming and expensive. Although weakly-supervised TAL (W-TAL) has been proposed to learn from untrimmed videos with only video-level labels, such an approach is also not suitable for on-device learning scenarios. In practical on-device learning applications, data are collected in streaming. Dividing such a long video stream into multiple video segments requires lots of human effort, which hinders the exploration of applying the TAL tasks to realistic on-device learning applications. To enable W-TAL models to learn from a long, untrimmed streaming video, we propose an efficient video learning approach that can directly adapt to new environments. We first propose a self-adaptive video dividing approach with a contrast score-based segment merging approach to convert the video stream into multiple segments. Then, we explore different sampling strategies on the TAL tasks to request as few labels as possible. To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.

updated: Thu Aug 25 2022 13:41:03 GMT+0000 (UTC)

published: Thu Aug 25 2022 13:41:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト