FLEX: Extrinsic Parameters-free Multi-view 3D Human Motion Reconstruction

Brian Gordon; Sigal Raab; Guy Azov; Raja Giryes; Daniel Cohen-Or

FLEX: 外部パラメータを使用しないマルチビュー 3D ヒューマンモーションリコンストラクション

複数のカメラで作成されたビデオ録画の可用性が向上したことで、ポーズとモーションの再構築方法におけるオクルージョンと深度のあいまいさを軽減するための新しい手段が提供されました。しかし、マルチビューアルゴリズムはカメラパラメータに大きく依存します。特に、カメラ間の相対的な変換。このような依存関係は、制御されていない設定で動的キャプチャに移行すると、ハードルになります。 FLEX (Free multi-view reconstruXion) を紹介します。これは、エンドツーエンドの外部パラメーターを使用しないマルチビューモデルです。 FLEX は、外部カメラパラメーターを必要としないという意味で、外部パラメーターフリー (ep フリーと呼ばれます) です。私たちの重要なアイデアは、骨格パーツ間の 3D 角度とボーンの長さがカメラの位置に対して不変であるということです。したがって、位置ではなく 3D 回転とボーンの長さを学習することで、すべてのカメラビューの共通値を予測できます。私たちのネットワークは、複数のビデオストリームを取得し、新しいマルチビューフュージョンレイヤーを通じてフュージョンされたディープフィーチャを学習し、時間的にコヒーレントな関節回転で単一の一貫したスケルトンを再構築します。 3 つの公開データセットと、ダイナミックカメラでキャプチャされた複数人の合成ビデオストリームに関する定量的および定性的な結果を示します。私たちのモデルを ep フリーではない最先端の方法と比較し、カメラパラメーターがない場合、カメラパラメーターが利用可能な場合に匹敵する結果を得ながら、それらを大幅に上回ることを示します。コード、トレーニング済みモデル、およびその他の資料は、プロジェクトページで入手できます。

The increasing availability of video recordings made by multiple cameras has offered new means for mitigating occlusion and depth ambiguities in pose and motion reconstruction methods. Yet, multi-view algorithms strongly depend on camera parameters; particularly, the relative transformations between the cameras. Such a dependency becomes a hurdle once shifting to dynamic capture in uncontrolled settings. We introduce FLEX (Free muLti-view rEconstruXion), an end-to-end extrinsic parameter-free multi-view model. FLEX is extrinsic parameter-free (dubbed ep-free) in the sense that it does not require extrinsic camera parameters. Our key idea is that the 3D angles between skeletal parts, as well as bone lengths, are invariant to the camera position. Hence, learning 3D rotations and bone lengths rather than locations allows predicting common values for all camera views. Our network takes multiple video streams, learns fused deep features through a novel multi-view fusion layer, and reconstructs a single consistent skeleton with temporally coherent joint rotations. We demonstrate quantitative and qualitative results on three public datasets, and on synthetic multi-person video streams captured by dynamic cameras. We compare our model to state-of-the-art methods that are not ep-free and show that in the absence of camera parameters, we outperform them by a large margin while obtaining comparable results when camera parameters are available. Code, trained models, and other materials are available on our project page.

updated: Fri Oct 21 2022 14:56:49 GMT+0000 (UTC)

published: Wed May 05 2021 09:08:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト