Scene-Aware 3D Multi-Human Motion Capture from a Single Camera

Diogo Luvizon; Marc Habermann; Vladislav Golyanik; Adam Kortylewski; Christian Theobalt

単一カメラからのシーン認識型 3D マルチヒューマンモーションキャプチャ

この作業では、静止カメラで記録された単一の RGB ビデオから、シーン内の複数の人間の 3D 位置、および体の形状と関節を推定する問題を検討します。高価なマーカーベースまたはマルチビューシステムとは対照的に、当社の軽量セットアップは、インストールが簡単で専門知識を必要としない手頃な価格の 3D モーションキャプチャを可能にするため、個人ユーザーに最適です。この困難な設定に対処するために、2D ボディジョイント、ジョイントアングル、正規化された視差マップ、人間のセグメンテーションマスクなど、さまざまなモダリティ向けの大規模な事前トレーニング済みモデルを使用して、コンピュータービジョンの最近の進歩を活用しています。したがって、各人間の絶対 3D 位置、関節のポーズ、個々の形状、およびシーンのスケールを共同で解決する最初の非線形最適化ベースのアプローチを導入します。特に、2D の身体関節と関節角度を使用して、正規化された視差予測からシーンの奥行きと人物固有のスケールを推定します。フレームごとのシーン深度を考慮して、3D 空間で静的シーンの点群を再構築します。最後に、人間とシーンの点群のフレームごとの 3D 推定値を考慮して、ビデオに対して時空間コヒーレント最適化を実行し、時間的、空間的、および物理的な妥当性を確保します。確立された複数人の 3D ヒューマンポーズベンチマークでこの方法を評価し、以前の方法よりも一貫して優れたパフォーマンスを発揮し、さまざまなサイズの人がいる困難なシーンを含む野生の条件に対してこの方法が堅牢であることを定性的に示します。

In this work, we consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera. In contrast to expensive marker-based or multi-view systems, our lightweight setup is ideal for private users as it enables an affordable 3D motion capture that is easy to install and does not require expert knowledge. To deal with this challenging setting, we leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks. Thus, we introduce the first non-linear optimization-based approach that jointly solves for the absolute 3D position of each human, their articulated pose, their individual shapes as well as the scale of the scene. In particular, we estimate the scene depth and person unique scale from normalized disparity predictions using the 2D body joints and joint angles. Given the per-frame scene depth, we reconstruct a point-cloud of the static scene in 3D space. Finally, given the per-frame 3D estimates of the humans and scene point-cloud, we perform a space-time coherent optimization over the video to ensure temporal, spatial and physical plausibility. We evaluate our method on established multi-person 3D human pose benchmarks where we consistently outperform previous methods and we qualitatively demonstrate that our method is robust to in-the-wild conditions including challenging scenes with people of different sizes.

updated: Mon Mar 27 2023 06:59:55 GMT+0000 (UTC)

published: Thu Jan 12 2023 18:01:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト