Learning Variational Motion Prior for Video-based Motion Capture

Xin Chen; Zhuo Su; Lingbo Yang; Pei Cheng; Lan Xu; Bin Fu; Gang Yu

ビデオベースのモーションキャプチャのための変動モーションプライアの学習

単眼ビデオからのモーションキャプチャは、私たち人間が仮想現実 (VR) と拡張現実 (AR) で自然に体験し、相互作用するために基本的かつ重要です。ただし、既存の方法は、モデリングの前に効果的なモーションが不足しているため、セルフオクルージョンや複雑なポーズを含む困難なケースに依然として苦労しています。このホワイトペーパーでは、上記の問題を解決するために、ビデオベースのモーションキャプチャのための新しいバリエーションモーションプライア (VMP) 学習アプローチを紹介します。ビデオとモーションドメイン間の対応を直接構築する代わりに、すべての自然なモーションの事前分布をキャプチャするための一般的な潜在空間を学習することを提案します。これは、後続のビデオベースのモーションキャプチャタスクの基礎となります。前空間の一般化能力を向上させるために、生成品質を向上させる新しいスタイルマッピングブロックを使用して、マーカーベースの 3D モーションキャプチャデータで事前トレーニングされたトランスフォーマーベースの変分オートエンコーダーを提案します。その後、個別のビデオエンコーダーが事前トレーニング済みのモーションジェネレーターに接続され、タスク固有のビデオデータセットに対するエンドツーエンドの微調整が行われます。既存のモーションの事前モデルと比較して、当社の VMP モデルはモーション整流器として機能し、フレーム単位のポーズ推定で一時的なジッタリングと障害モードを効果的に削減できるため、一時的に安定した視覚的にリアルなモーションキャプチャ結果が得られます。さらに、当社の VMP ベースのフレームワークはシーケンスレベルでモーションをモデル化し、フォワードパスでモーションクリップを直接生成できるため、推論中にリアルタイムのモーションキャプチャを実現できます。公開データセットと実際のビデオの両方に対する広範な実験により、フレームワークの有効性と一般化機能が実証されました。

Motion capture from a monocular video is fundamental and crucial for us humans to naturally experience and interact with each other in Virtual Reality (VR) and Augmented Reality (AR). However, existing methods still struggle with challenging cases involving self-occlusion and complex poses due to the lack of effective motion prior modeling. In this paper, we present a novel variational motion prior (VMP) learning approach for video-based motion capture to resolve the above issue. Instead of directly building the correspondence between the video and motion domain, We propose to learn a generic latent space for capturing the prior distribution of all natural motions, which serve as the basis for subsequent video-based motion capture tasks. To improve the generalization capacity of prior space, we propose a transformer-based variational autoencoder pretrained over marker-based 3D mocap data, with a novel style-mapping block to boost the generation quality. Afterward, a separate video encoder is attached to the pretrained motion generator for end-to-end fine-tuning over task-specific video datasets. Compared to existing motion prior models, our VMP model serves as a motion rectifier that can effectively reduce temporal jittering and failure modes in frame-wise pose estimation, leading to temporally stable and visually realistic motion capture results. Furthermore, our VMP-based framework models motion at sequence level and can directly generate motion clips in the forward pass, achieving real-time motion capture during inference. Extensive experiments over both public datasets and in-the-wild videos have demonstrated the efficacy and generalization capability of our framework.

updated: Fri Oct 28 2022 02:32:05 GMT+0000 (UTC)

published: Thu Oct 27 2022 02:45:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト