H2O: Two Hands Manipulating Objects for First Person Interaction Recognition

Taein Kwon; Bugra Tekin; Jan Stuhmer; Federica Bogo; Marc Pollefeys

H2O：一人称インタラクション認識のためにオブジェクトを操作する両手

オブジェクトを操作する両手のマーカーレス3D注釈を使用して、自己中心的な相互作用を認識するための包括的なフレームワークを紹介します。この目的のために、自己中心的な3D相互作用認識のための統一されたデータセットを作成する方法を提案します。私たちの方法では、両手の3Dポーズと操作されたオブジェクトの6Dポーズの注釈が、各フレームのインタラクションラベルとともに生成されます。 H2O（2 Hands and Objects）と呼ばれるデータセットは、同期されたマルチビューRGB-D画像、インタラクションラベル、オブジェクトクラス、左手と右手のグラウンドトゥルース3Dポーズ、6Dオブジェクトポーズ、グラウンドトゥルースカメラポーズ、オブジェクトを提供しますメッシュとシーンポイントクラウド。私たちの知る限り、これは、オブジェクトを操作する左手と右手の両方のポーズを使用して一人称の行動の研究を可能にし、自己中心的な3D相互作用認識のための前例のないレベルの詳細を提示する最初のベンチマークです。さらに、RGB画像から共同で両手の3Dポーズと操作対象の6Dポーズを推定することにより、相互作用クラスを予測する方法を提案します。私たちの方法は、相互作用を予測するグラフ畳み込みネットワークのトポロジーを学習することにより、手とオブジェクトの両方の間の相互依存性と内部依存性の両方をモデル化します。このデータセットによって促進される私たちの方法は、手関節の姿勢推定のための強力なベースラインを確立し、一人称の相互作用認識のための最先端の精度を達成することを示しています。

We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects. To this end, we propose a method to create a unified dataset for egocentric 3D interaction recognition. Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame. Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds. To the best of our knowledge, this is the first benchmark that enables the study of first-person actions with the use of the pose of both left and right hands manipulating objects and presents an unprecedented level of detail for egocentric 3D interaction recognition. We further propose the method to predict interaction classes by estimating the 3D pose of two hands and the 6D pose of the manipulated objects, jointly from RGB images. Our method models both inter- and intra-dependencies between both hands and objects by learning the topology of a graph convolutional network that predicts interactions. We show that our method facilitated by this dataset establishes a strong baseline for joint hand-object pose estimation and achieves state-of-the-art accuracy for first person interaction recognition.

updated: Tue Aug 24 2021 15:21:38 GMT+0000 (UTC)

published: Thu Apr 22 2021 17:10:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト