Putting People in their Place: Monocular Regression of 3D People in Depth

Yu Sun; Wu Liu; Qian Bao; Yili Fu; Tao Mei; Michael J. Black

人々をその場に置く：3D人々の深さの単眼回帰

複数の人がいるイメージを考えると、私たちの目標は、すべての人のポーズと形、およびそれらの相対的な深さを直接回帰することです。ただし、画像内の人物の奥行きを推測することは、人物の身長を知らなくても基本的にあいまいです。これは、シーンに幼児から大人まで、非常に異なるサイズの人々が含まれている場合に特に問題になります。これを解決するには、いくつかのことが必要です。まず、1つの画像で複数の人物のポーズと奥行きを推測する新しい方法を開発します。複数の人を推定する以前の作業では、画像平面で推論することでこれを行いますが、BEVと呼ばれるこの方法では、奥行きについて明示的に推論するために、架空の鳥瞰図表現を追加します。 BEVは、画像内の体の中心と奥行きについて同時に推論し、これらを組み合わせることにより、3Dの体の位置を推定します。以前の作業とは異なり、BEVはエンドツーエンドで差別化できるシングルショット方式です。第二に、高さは年齢によって変化するため、画像内の人物の年齢も推定せずに深度を解決することは不可能です。そのために、BEVが幼児から大人までの形状を推測できる3Dボディモデル空間を活用します。第三に、BEVをトレーニングするには、新しいデータセットが必要です。具体的には、年齢ラベルと画像内の人物間の相対的な深さの関係を含む「Relative Human」（RH）データセットを作成します。 RHとAGORAに関する広範な実験は、モデルとトレーニングスキームの有効性を示しています。 BEVは、深さの推論、子の形状の推定、および閉塞に対するロバスト性に関して、既存の方法よりも優れています。コードとデータセットは、研究目的でリリースされます。

Given an image with multiple people, our goal is to directly regress the pose and shape of all the people as well as their relative depth. Inferring the depth of a person in an image, however, is fundamentally ambiguous without knowing their height. This is particularly problematic when the scene contains people of very different sizes, e.g. from infants to adults. To solve this, we need several things. First, we develop a novel method to infer the poses and depth of multiple people in a single image. While previous work that estimates multiple people does so by reasoning in the image plane, our method, called BEV, adds an additional imaginary Bird's-Eye-View representation to explicitly reason about depth. BEV reasons simultaneously about body centers in the image and in depth and, by combing these, estimates 3D body position. Unlike prior work, BEV is a single-shot method that is end-to-end differentiable. Second, height varies with age, making it impossible to resolve depth without also estimating the age of people in the image. To do so, we exploit a 3D body model space that lets BEV infer shapes from infants to adults. Third, to train BEV, we need a new dataset. Specifically, we create a "Relative Human" (RH) dataset that includes age labels and relative depth relationships between the people in the images. Extensive experiments on RH and AGORA demonstrate the effectiveness of the model and training scheme. BEV outperforms existing methods on depth reasoning, child shape estimation, and robustness to occlusion. The code and dataset will be released for research purposes.

updated: Wed Dec 15 2021 17:08:17 GMT+0000 (UTC)

published: Wed Dec 15 2021 17:08:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト