PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound

Zhijian Yang; Xiaoran Fan; Volkan Isler; Hyun Soo Park

PoseKernelLifter：サウンドを使用した3D人間のポーズのメトリックリフティング

単一のビュー画像からメートルスケールで人物の3Dポーズを再構築することは、幾何学的に不適切な問題です。たとえば、追加のシーンの仮定（たとえば、既知の高さ）がないと、単一のビュー画像からカメラまでの人物の正確な距離を測定することはできません。既存の学習ベースのアプローチは、3Dポーズをスケールに合わせて再構築することにより、この問題を回避します。ただし、仮想テレプレゼンス、ロボット工学、拡張現実など、メートル法による縮尺の再構築を必要とする多くのアプリケーションがあります。この論文では、画像とともに録音された音声信号が、人物のメトリック3Dポーズを再構築するための補足情報を提供することを示します。重要な洞察は、オーディオ信号が3D空間を通過するときに、身体との相互作用が身体のポーズに関するメトリック情報を提供することです。この洞察に基づいて、ポーズカーネルと呼ばれる時不変の伝達関数を導入します。これは、身体のポーズによって誘発されるオーディオ信号のインパルス応答です。ポーズカーネルの主な特性は、（1）エンベロープが3Dポーズと高度に相関していること、（2）時間応答が到着時間に対応し、マイクまでの距離を示すこと、（3）変化に不変であることです。シーンジオメトリ構成。したがって、見えないシーンに簡単に一般化できます。オーディオ信号とビジュアル信号を融合し、メトリックスケールで3Dポーズを再構築することを学習するマルチステージ3DCNNを設計します。マルチモーダル手法により、実世界のシーンで正確なメトリック再構成が生成されることを示します。これは、パラメトリックメッシュ回帰や深度回帰などの最先端のリフティングアプローチでは不可能です。

Reconstructing the 3D pose of a person in metric scale from a single view image is a geometrically ill-posed problem. For example, we can not measure the exact distance of a person to the camera from a single view image without additional scene assumptions (e.g., known height). Existing learning based approaches circumvent this issue by reconstructing the 3D pose up to scale. However, there are many applications such as virtual telepresence, robotics, and augmented reality that require metric scale reconstruction. In this paper, we show that audio signals recorded along with an image, provide complementary information to reconstruct the metric 3D pose of the person. The key insight is that as the audio signals traverse across the 3D space, their interactions with the body provide metric information about the body's pose. Based on this insight, we introduce a time-invariant transfer function called pose kernel -- the impulse response of audio signals induced by the body pose. The main properties of the pose kernel are that (1) its envelope highly correlates with 3D pose, (2) the time response corresponds to arrival time, indicating the metric distance to the microphone, and (3) it is invariant to changes in the scene geometry configurations. Therefore, it is readily generalizable to unseen scenes. We design a multi-stage 3D CNN that fuses audio and visual signals and learns to reconstruct 3D pose in a metric scale. We show that our multi-modal method produces accurate metric reconstruction in real world scenes, which is not possible with state-of-the-art lifting approaches including parametric mesh regression and depth regression.

updated: Fri Dec 03 2021 00:26:50 GMT+0000 (UTC)

published: Wed Dec 01 2021 01:34:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト