Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Yu Cheng; Bo Wang; Robby T. Tan

単眼ビデオからのデュアルネットワークベースの3Dマルチパーソンポーズ推定

単眼3D人間ポーズ推定は、近年進歩を遂げています。ほとんどの方法は、人物中心の座標、つまり対象人物の中心に基づく座標でポーズを推定する独身者に焦点を当てています。したがって、これらの方法は、絶対座標（カメラ座標など）が必要な複数人の3Dポーズ推定には適用できません。さらに、複数人のポーズの推定は、個人間の閉塞と人間との密接な相互作用のために、単一のポーズの推定よりも困難です。既存のトップダウンの複数人の方法は、人間の検出（すなわち、トップダウンのアプローチ）に依存しているため、検出エラーに悩まされ、複数人のシーンで信頼できるポーズ推定を生成できません。一方、人間の検出を使用しない既存のボトムアップ方式は、検出エラーの影響を受けませんが、シーン内のすべての人物を一度に処理するため、特に小規模な人物の場合、エラーが発生しやすくなります。これらすべての課題に対処するために、トップダウンとボトムアップのアプローチを統合して、それらの長所を活用することを提案します。私たちのトップダウンネットワークは、画像パッチ内の1つではなく、すべての人から人間の関節を推定し、誤った境界ボックスの可能性に対して堅牢にします。当社のボトムアップネットワークには、人間の検出に基づく正規化されたヒートマップが組み込まれているため、ネットワークはスケールの変動をより堅牢に処理できます。最後に、トップダウンおよびボトムアップネットワークから推定された3Dポーズは、最終的な3Dポーズのために統合ネットワークに送られます。トレーニングデータとテストデータの間の一般的なギャップに対処するために、高次の時間的制約、再投影損失、および骨の長さの正則化を使用して、推定された3D人間のポーズを調整することにより、テスト時間中に最適化を行います。私たちの評価は、提案された方法の有効性を示しています。コードとモデルが利用可能です：https：//github.com/3dpose/3D-Multi-Person-Pose。

Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. Our evaluations demonstrate the effectiveness of the proposed method. Code and models are available: https://github.com/3dpose/3D-Multi-Person-Pose.

updated: Wed May 04 2022 07:08:12 GMT+0000 (UTC)

published: Mon May 02 2022 08:53:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト