Discovering a Variety of Objects in Spatio-Temporal Human-Object Interactions

Yong-Lu Li; Hongwei Fan; Zuoyu Qiu; Yiming Dou; Liang Xu; Hao-Shu Fang; Peiyang Guo; Haisheng Su; Dongliang Wang; Wei Wu; Cewu Lu

時空間的な人間とオブジェクトの相互作用におけるさまざまなオブジェクトの発見

時空間ヒューマンオブジェクトインタラクション (ST-HOI) 検出は、ビデオから HOI を検出することを目的としています。これは、アクティビティの理解に不可欠です。毎日の HOI では、人間はさまざまなオブジェクトと対話することがよくあります。たとえば、掃除中に何十もの家庭用品を持ったり、触れたりします。ただし、既存の全身オブジェクトインタラクションビデオベンチマークでは、通常、限られたオブジェクトクラスが提供されます。ここでは、AVA に基づく新しいベンチマークを紹介します。51 のインタラクションと 1,000 以上のオブジェクトを含む Discovering Interacted Objects (DIO) です。したがって、視覚システムが人間の俳優を追跡し、相互作用を検出し、同時に相互作用するオブジェクトを発見することを期待するST-HOI学習タスクが提案されています。今日の検出器/トラッカーは、オブジェクトの検出/追跡タスクに優れていますが、DIO で多様な/目に見えないオブジェクトをローカライズするには不十分です。これは、現在の視覚システムの限界を深く明らかにしており、大きな課題となっています。したがって、オブジェクトの発見に対処するために時空間の手がかりを活用する方法が検討され、階層的な時空間の人間/コンテキストの手がかりを利用して相互作用するオブジェクトを発見するために、階層プローブネットワーク (HPN) が考案されます。大規模な実験で、HPN は印象的なパフォーマンスを示しています。データとコードは https://github.com/DirtyHarryLYL/HAKE-AVA で入手できます。

Spatio-temporal Human-Object Interaction (ST-HOI) detection aims at detecting HOIs from videos, which is crucial for activity understanding. In daily HOIs, humans often interact with a variety of objects, e.g., holding and touching dozens of household items in cleaning. However, existing whole body-object interaction video benchmarks usually provide limited object classes. Here, we introduce a new benchmark based on AVA: Discovering Interacted Objects (DIO) including 51 interactions and 1,000+ objects. Accordingly, an ST-HOI learning task is proposed expecting vision systems to track human actors, detect interactions and simultaneously discover interacted objects. Even though today's detectors/trackers excel in object detection/tracking tasks, they perform unsatisfied to localize diverse/unseen objects in DIO. This profoundly reveals the limitation of current vision systems and poses a great challenge. Thus, how to leverage spatio-temporal cues to address object discovery is explored, and a Hierarchical Probe Network (HPN) is devised to discover interacted objects utilizing hierarchical spatio-temporal human/context cues. In extensive experiments, HPN demonstrates impressive performance. Data and code are available at https://github.com/DirtyHarryLYL/HAKE-AVA.

updated: Mon Nov 14 2022 16:33:54 GMT+0000 (UTC)

published: Mon Nov 14 2022 16:33:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト