P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

Wenkang Shan; Zhenhua Liu; Xinfeng Zhang; Shanshe Wang; Siwei Ma; Wen Gao

P-STMO：3D人間のポーズ推定のための事前トレーニングされた時空間多対1モデル

この論文は、2Dから3Dへの人間の姿勢推定タスクのための新しい事前訓練された時空間多対1（P-STMO）モデルを紹介します。空間的および時間的情報の取得の難しさを軽減するために、このタスクを事前トレーニング（ステージI）と微調整（ステージII）の2つのステージに分割します。ステージIでは、マスクされたポーズモデリングと呼ばれる自己監視型の事前トレーニングサブタスクが提案されます。入力シーケンスの人間の関節は、空間ドメインと時間ドメインの両方でランダムにマスクされます。ノイズ除去オートエンコーダの一般的な形式を利用して、元の2Dポーズを復元します。エンコーダは、この方法で空間的および時間的な依存関係をキャプチャできます。ステージIIでは、事前にトレーニングされたエンコーダーがSTMOモデルにロードされ、微調整されます。エンコーダの後には、現在のフレームの3Dポーズを予測するための多対1のフレームアグリゲータが続きます。特に、MLPブロックはSTMOの空間特徴抽出器として利用されており、他の方法よりも優れたパフォーマンスを発揮します。さらに、データの冗長性を減らすために、一時的なダウンサンプリング戦略が提案されています。 2つのベンチマークでの広範な実験は、私たちの方法が、より少ないパラメーターとより少ない計算オーバーヘッドで、最先端の方法よりも優れていることを示しています。たとえば、CPNからの2Dポーズを入力として使用すると、P-STMOモデルはHuman3.6Mデータセットで42.1mmMPJPEを達成します。その間、それは最先端の方法に1.5-7.1倍のスピードアップをもたらします。コードはhttps://github.com/paTRICK-swk/P-STMOで入手できます。

This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task. To reduce the difficulty of capturing spatial and temporal information, we divide this task into two stages: pre-training (Stage I) and fine-tuning (Stage II). In Stage I, a self-supervised pre-training sub-task, termed masked pose modeling, is proposed. The human joints in the input sequence are randomly masked in both spatial and temporal domains. A general form of denoising auto-encoder is exploited to recover the original 2D poses and the encoder is capable of capturing spatial and temporal dependencies in this way. In Stage II, the pre-trained encoder is loaded to STMO model and fine-tuned. The encoder is followed by a many-to-one frame aggregator to predict the 3D pose in the current frame. Especially, an MLP block is utilized as the spatial feature extractor in STMO, which yields better performance than other methods. In addition, a temporal downsampling strategy is proposed to diminish data redundancy. Extensive experiments on two benchmarks show that our method outperforms state-of-the-art methods with fewer parameters and less computational overhead. For example, our P-STMO model achieves 42.1mm MPJPE on Human3.6M dataset when using 2D poses from CPN as inputs. Meanwhile, it brings a 1.5-7.1 times speedup to state-of-the-art methods. Code is available at https://github.com/paTRICK-swk/P-STMO.

updated: Tue Mar 15 2022 04:00:59 GMT+0000 (UTC)

published: Tue Mar 15 2022 04:00:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト