EgoHumans: An Egocentric 3D Multi-Human Benchmark

Rawal Khirodkar; Aayush Bansal; Lingni Ma; Richard Newcombe; Minh Vo; Kris Kitani

EgoHumans: 自己中心的な 3D マルチヒューマンベンチマーク

私たちは、最先端の自己中心的な人間の 3D 姿勢推定と追跡を進歩させるための、新しいマルチビューマルチヒューマンビデオベンチマークである EgoHumans を紹介します。既存の自己中心的なベンチマークは、単一の対象または屋内のみのシナリオをキャプチャするため、現実世界のアプリケーション向けのコンピュータービジョンアルゴリズムの一般化が制限されます。私たちは、人間の検出、追跡、2D/3D 姿勢推定、メッシュ回復などのさまざまなタスクをサポートするための注釈を備えた、野生環境における包括的な自己中心的なマルチヒューマンベンチマークを構築するための新しい 3D キャプチャセットアップを提案します。私たちは、自己中心的なビューに消費者向けのウェアラブルカメラを備えたメガネを活用しており、これにより、テニス、フェンシング、バレーボールなどのダイナミックなアクティビティを捉えることができます。さらに、当社のマルチビューセットアップは、重度または完全なオクルージョンの下でも正確な 3D グラウンドトゥルースを生成します。。データセットは 125,000 を超える自己中心的な画像で構成されており、さまざまなシーンにまたがっており、特に挑戦的で振り付けのない複数の人間の活動や、動きの速い自己中心的なビューに焦点を当てています。私たちは既存の最先端の手法を厳密に評価し、自己中心的なシナリオ、特に複数人の追跡におけるその限界を強調します。このような制限に対処するために、私たちは人間の姿勢を推定して追跡するためのマルチストリームトランスフォーマーアーキテクチャと明示的な 3D 空間推論を備えた新しいアプローチである EgoFormer を提案します。 EgoFormer は、EgoHumans データセット上で IDF1 の 13.6% という点で従来技術を大幅に上回っています。

We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations to support diverse tasks such as human detection, tracking, 2D/3D pose estimation, and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing tennis, fencing, volleyball, etc. Furthermore, our multi-view setup generates accurate 3D ground truth even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario, specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a multi-stream transformer architecture and explicit 3D spatial reasoning to estimate and track the human pose. EgoFormer significantly outperforms prior art by 13.6% IDF1 on the EgoHumans dataset.

updated: Fri Aug 18 2023 23:28:45 GMT+0000 (UTC)

published: Thu May 25 2023 21:37:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト