AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation

Takehiko Ohkawa; Kun He; Fadime Sener; Tomas Hodan; Luan Tran; Cem Keskin

AssemblyHands: 3D ハンドポーズ推定による自己中心的な活動の理解に向けて

AssemblyHands は、正確な 3D ハンドポーズアノテーションを備えた大規模なベンチマークデータセットであり、挑戦的な手とオブジェクトの相互作用を伴う自己中心的な活動の研究を容易にします。このデータセットには、最近の Assembly101 データセットからサンプリングされた同期された自己中心的およびエキソセントリックな画像が含まれており、参加者は分解おもちゃを組み立てたり分解したりします。自己中心的な画像に対して高品質の 3D ハンドポーズアノテーションを取得するために、効率的なパイプラインを開発します。このパイプラインでは、手動アノテーションの初期セットを使用してモデルをトレーニングし、はるかに大きなデータセットに自動的にアノテーションを付けます。私たちのアノテーションモデルは、マルチビューフィーチャフュージョンと反復改良スキームを使用し、4.20 mm の平均キーポイントエラーを達成します。これは、Assembly101 の元のアノテーションのエラーよりも 85% 低くなります。 AssemblyHands は、490K の自己中心的な画像を含む 3.0M の注釈付き画像を提供し、自己中心的な 3D ハンドポーズ推定のための既存の最大のベンチマークデータセットとなっています。このデータを使用して、自己中心的な画像からの 3D ハンドポーズ推定の強力な単一ビューベースラインを開発します。さらに、予測された 3D の手のポーズを評価するための新しいアクション分類タスクを設計します。私たちの研究は、より質の高い手のポーズをとることで、行動を認識する能力が直接向上することを示しています。

We present AssemblyHands, a large-scale benchmark dataset with accurate 3D hand pose annotations, to facilitate the study of egocentric activities with challenging hand-object interactions. The dataset includes synchronized egocentric and exocentric images sampled from the recent Assembly101 dataset, in which participants assemble and disassemble take-apart toys. To obtain high-quality 3D hand pose annotations for the egocentric images, we develop an efficient pipeline, where we use an initial set of manual annotations to train a model to automatically annotate a much larger dataset. Our annotation model uses multi-view feature fusion and an iterative refinement scheme, and achieves an average keypoint error of 4.20 mm, which is 85% lower than the error of the original annotations in Assembly101. AssemblyHands provides 3.0M annotated images, including 490K egocentric images, making it the largest existing benchmark dataset for egocentric 3D hand pose estimation. Using this data, we develop a strong single-view baseline of 3D hand pose estimation from egocentric images. Furthermore, we design a novel action classification task to evaluate predicted 3D hand poses. Our study shows that having higher-quality hand poses directly improves the ability to recognize actions.

updated: Mon Apr 24 2023 17:52:57 GMT+0000 (UTC)

published: Mon Apr 24 2023 17:52:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト