Detecting Human-Object Contact in Images

Yixin Chen; Sai Kumar Dwivedi; Michael J. Black; Dimitrios Tzionas

画像内の人間と物体の接触を検出する

人間は常に物体に接触して移動し、タスクを実行しています。したがって、人間中心の人工知能を構築するには、人間と物体の接触を検出することが重要です。ただし、画像から身体とシーンの間の接触を検出する堅牢な方法は存在せず、そのような検出器を学習するためのデータセットも存在しません。このギャップを HOT ("Human-Object conTact") で埋めます。これは、画像に対する人間とオブジェクトの接触の新しいデータセットです。 HOT を構築するために、2 つのデータソースを使用します。(1) 3D シーン内を移動する 3D ヒューマンメッシュの PROX データセットを使用し、3D メッシュの近接と投影を介して、2D 画像領域に接触の注釈を自動的に付けます。 (2) V-COCO、HAKE、Watch-n-Patch データセットを使用し、トレーニングを受けたアノテーターに、接触が発生する 2D 画像領域のポリゴンを描画するよう依頼します。また、人体の関連する身体部分に注釈を付けます。 HOT データセットを使用して、単一のカラー画像を入力として受け取り、2D 接触ヒートマップと接触している身体部分のラベルを出力する新しい接触検出器をトレーニングします。これは、現在の足と地面または手と物体の接触検出器を全身の完全な一般性に拡張する、新しく挑戦的なタスクです。検出器は、部分注意ブランチを使用して、周囲の身体部分とシーンのコンテキストを通じて接触推定を導きます。検出器を広範囲に評価し、定量的な結果は、モデルがベースラインよりも優れていること、およびすべてのコンポーネントがパフォーマンスの向上に貢献していることを示しています。オンラインリポジトリからの画像の結果は、合理的な検出と一般化可能性を示しています。

Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts for images. To build HOT, we use two data sources: (1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons for the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability.

updated: Mon Mar 06 2023 18:56:26 GMT+0000 (UTC)

published: Mon Mar 06 2023 18:56:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト