Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer

Shihao Zou; Yuxuan Mu; Xinxin Zuo; Sen Wang; Li Cheng

時空間トランスフォーマーのスパイクによるイベントベースの人間の姿勢追跡

イベントカメラは、モーションダイナミクスをキャプチャするための新しい生物学に着想を得たビジョンセンサーであり、3D 人間の姿勢追跡、またはビデオベースの 3D 人間姿勢推定に新たな可能性をもたらします。ただし、ポーズトラッキングの既存の作業では、追加のグレースケールイメージの存在が確実な開始ポーズを確立する必要があるか、イベントストリームのセグメントを折りたたんで静的なイベントフレームを形成することにより、一時的な依存関係をすべて無視します。一方、人工ニューラルネットワーク (ANN、別名高密度ディープラーニング) の有効性は、多くのイベントベースのタスクで示されていますが、ANN の使用は、高密度のフレームベースの画像シーケンスと比較して、イベントカメラからのイベントは、時空間的にはるかにまばらです。上記の問題に動機付けられて、このホワイトペーパーでは、イベントベースのポーズ追跡のための専用のエンドツーエンドのスパースディープラーニングアプローチを提示します。、したがって、入力の一部としてフレームベースの画像にアクセスする必要がなくなります。 2）私たちのアプローチは、Spike-Element-Wise（SEW）ResNetと新しいSpiking Spatiotemporal Transformerで構成されるSpiking Neural Networks（SNN）のフレームワークに完全に基づいています。 3) 大規模な合成データセットが構築され、SynEventHPD という名前の、注釈付きの 3D 人間の動きの広範かつ多様なセットと、長時間のイベントストリームデータが含まれます。経験的実験は、最先端の (SOTA) ANN の対応物よりも優れたパフォーマンスで、私たちのアプローチが FLOPS で 80% の大幅な計算削減も達成することを示しています。さらに、提案された方法は、人間の姿勢追跡の回帰タスクでも SOTA SNN よりも優れています。私たちの実装は https://github.com/JimmyZou/HumanPoseTracking_SNN で入手でき、データセットは論文が承認され次第リリースされます。

Event camera, as an emerging biologically-inspired vision sensor for capturing motion dynamics, presents new potential for 3D human pose tracking, or video-based 3D human pose estimation. However, existing works in pose tracking either require the presence of additional gray-scale images to establish a solid starting pose, or ignore the temporal dependencies all together by collapsing segments of event streams to form static event frames. Meanwhile, although the effectiveness of Artificial Neural Networks (ANNs, a.k.a. dense deep learning) has been showcased in many event-based tasks, the use of ANNs tends to neglect the fact that compared to the dense frame-based image sequences, the occurrence of events from an event camera is spatiotemporally much sparser. Motivated by the above mentioned issues, we present in this paper a dedicated end-to-end sparse deep learning approach for event-based pose tracking: 1) to our knowledge this is the first time that 3D human pose tracking is obtained from events only, thus eliminating the need of accessing to any frame-based images as part of input; 2) our approach is based entirely upon the framework of Spiking Neural Networks (SNNs), which consists of Spike-Element-Wise (SEW) ResNet and a novel Spiking Spatiotemporal Transformer; 3) a large-scale synthetic dataset is constructed that features a broad and diverse set of annotated 3D human motions, as well as longer hours of event stream data, named SynEventHPD. Empirical experiments demonstrate that, with superior performance over the state-of-the-art (SOTA) ANNs counterparts, our approach also achieves a significant computation reduction of 80% in FLOPS. Furthermore, our proposed method also outperforms SOTA SNNs in the regression task of human pose tracking. Our implementation is available at https://github.com/JimmyZou/HumanPoseTracking_SNN and dataset will be released upon paper acceptance.

updated: Wed Sep 06 2023 21:34:59 GMT+0000 (UTC)

published: Thu Mar 16 2023 22:56:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト