Capturing and Inferring Dense Full-Body Human-Scene Contact

Chun-Hao P. Huang; Hongwei Yi; Markus Höschle; Matvey Safroshkin; Tsvetelina Alexiadis; Senya Polikovsky; Daniel Scharstein; Michael J. Black

高密度の全身の人間とシーンの接触をキャプチャして推測する

人間とシーンの接触（HSC）を推測することは、人間が周囲とどのように相互作用するかを理解するための最初のステップです。 2Dの人間と物体の相互作用（HOI）を検出し、3Dの人間のポーズと形状（HPS）を再構築することは大きな進歩を遂げましたが、単一の画像から3Dの人間とシーンの接触について推論することは依然として困難です。既存のHSC検出方法は、事前定義された数種類の接触のみを考慮し、多くの場合、身体とシーンを少数のプリミティブに減らし、画像の証拠を見落とすことさえあります。単一の画像から人間とシーンの接触を予測するために、データとアルゴリズムの両方の観点から上記の制限に対処します。「リアルシーン、インタラクション、コンタクト、ヒューマン」のRICHという新しいデータセットをキャプチャします。 RICHには、4K解像度のマルチビュー屋外/屋内ビデオシーケンス、マーカーレスモーションキャプチャを使用してキャプチャされたグラウンドトゥルース3D人体、3Dボディスキャン、および高解像度3Dシーンスキャンが含まれています。 RICHの重要な機能は、ボディに正確な頂点レベルの接触ラベルも含まれていることです。 RICHを使用して、単一のRGB画像から高密度のボディシーンの接触を予測するネットワークをトレーニングします。私たちの重要な洞察は、接触している領域は常に閉塞されているため、ネットワークには証拠を得るために画像全体を探索する機能が必要であるということです。トランスフォーマーを使用して、このような非ローカルな関係を学習し、新しいBody-SceneコンタクトTRansfOrmer（BSTRO）を提案します。 3D接触を調査する方法はほとんどありません。足だけに焦点を合わせたり、後処理ステップとして足の接触を検出したり、シーンを見ずに体のポーズから接触を推測したりするもの。私たちの知る限り、BSTROは、単一の画像から3Dボディシーンの接触を直接推定する最初の方法です。 BSTROが従来技術を大幅に上回っていることを示します。コードとデータセットはhttps://rich.is.tue.mpg.deで入手できます。

Inferring human-scene contact (HSC) is the first step toward understanding how humans interact with their surroundings. While detecting 2D human-object interaction (HOI) and reconstructing 3D human pose and shape (HPS) have enjoyed significant progress, reasoning about 3D human-scene contact from a single image is still challenging. Existing HSC detection methods consider only a few types of predefined contact, often reduce body and scene to a small number of primitives, and even overlook image evidence. To predict human-scene contact from a single image, we address the limitations above from both data and algorithmic perspectives. We capture a new dataset called RICH for "Real scenes, Interaction, Contact and Humans." RICH contains multiview outdoor/indoor video sequences at 4K resolution, ground-truth 3D human bodies captured using markerless motion capture, 3D body scans, and high resolution 3D scene scans. A key feature of RICH is that it also contains accurate vertex-level contact labels on the body. Using RICH, we train a network that predicts dense body-scene contacts from a single RGB image. Our key insight is that regions in contact are always occluded so the network needs the ability to explore the whole image for evidence. We use a transformer to learn such non-local relationships and propose a new Body-Scene contact TRansfOrmer (BSTRO). Very few methods explore 3D contact; those that do focus on the feet only, detect foot contact as a post-processing step, or infer contact from body pose without looking at the scene. To our knowledge, BSTRO is the first method to directly estimate 3D body-scene contact from a single image. We demonstrate that BSTRO significantly outperforms the prior art. The code and dataset are available at https://rich.is.tue.mpg.de.

updated: Mon Jun 20 2022 03:31:00 GMT+0000 (UTC)

published: Mon Jun 20 2022 03:31:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト