Visibility Aware Human-Object Interaction Tracking from Single RGB Camera

Xianghui Xie; Bharat Lal Bhatnagar; Gerard Pons-Moll

単一の RGB カメラからの可視性を意識したヒューマンオブジェクトインタラクショントラッキング

人間とその環境との相互作用を 3D でキャプチャすることは、ロボット工学、グラフィックス、ビジョンの多くのアプリケーションにとって重要です。単一の RGB イメージから 3D の人間とオブジェクトを再構築する最近の作業では、固定の深度を想定しているため、フレーム全体で一貫した相対的な変換が行われません。さらに、オブジェクトが遮られると、パフォーマンスが大幅に低下します。この作業では、重度のオクルージョンに対して堅牢でありながら、3D の人間、オブジェクト、それらの間の接触、および単一の RGB カメラからのフレーム全体の相対的な移動を追跡する新しい方法を提案します。私たちの方法は、2 つの重要な洞察に基づいて構築されています。まず、SMPL をビデオシーケンスに事前に適合させることによって得られたフレームごとの SMPL モデル推定値に基づいて、人間とオブジェクトのニューラルフィールド再構成を調整します。これにより、ニューラル再構成の精度が向上し、フレーム全体で一貫した相対的な翻訳が生成されます。第二に、可視フレームからの人間と物体の動きは、遮られた物体を推測するための貴重な情報を提供します。オブジェクトの可視性と人間の動きを明示的に使用して、隣接するフレームを活用して、遮蔽されたフレームの予測を行う、新しいトランスフォーマーベースのニューラルネットワークを提案します。これらの洞察に基づいて、私たちの方法は、遮蔽下でも人間と物体の両方を確実に追跡できます。 2 つのデータセットでの実験は、私たちの方法が最先端の方法よりも大幅に改善されていることを示しています。コードと事前トレーニング済みのモデルは、https://virtualhumans.mpi-inf.mpg.de/VisTracker で入手できます。

Capturing the interactions between humans and their environment in 3D is important for many applications in robotics, graphics, and vision. Recent works to reconstruct the 3D human and object from a single RGB image do not have consistent relative translation across frames because they assume a fixed depth. Moreover, their performance drops significantly when the object is occluded. In this work, we propose a novel method to track the 3D human, object, contacts between them, and their relative translation across frames from a single RGB camera, while being robust to heavy occlusions. Our method is built on two key insights. First, we condition our neural field reconstructions for human and object on per-frame SMPL model estimates obtained by pre-fitting SMPL to a video sequence. This improves neural reconstruction accuracy and produces coherent relative translation across frames. Second, human and object motion from visible frames provides valuable information to infer the occluded object. We propose a novel transformer-based neural network that explicitly uses object visibility and human motion to leverage neighbouring frames to make predictions for the occluded frames. Building on these insights, our method is able to track both human and object robustly even under occlusions. Experiments on two datasets show that our method significantly improves over the state-of-the-art methods. Our code and pretrained models are available at: https://virtualhumans.mpi-inf.mpg.de/VisTracker

updated: Wed Mar 29 2023 06:23:44 GMT+0000 (UTC)

published: Wed Mar 29 2023 06:23:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト