Human-to-Human Interaction Detection

Zhenhua Wang; Kaining Ying; Jiajun Meng; Jifeng Ning

人対人のインタラクションの検出

列に並ぶ、握手する、喧嘩する、追いかけるなど、ビデオストリームにおける人間と人間の関心のあるやり取りを包括的に理解することは、キャンパス、広場、公園などの地域の治安を監視する上で非常に重要です。振り付けされたビデオを入力として使用し、同時のインタラクティブグループを無視し、別々の段階で検出と認識を実行する従来の人間のインタラクション認識とは異なり、人間対人間のインタラクション検出 (HID) と呼ばれる新しいタスクを導入します。 HID は、1 つのモデルで、被写体の検出、個人ごとの行動の認識、およびインタラクティブな関係に従って人々をグループ化することに専念しています。まず、アクション検出用に作成された一般的な AVA データセットに基づいて、フレームごとにインタラクティブな関係に注釈を追加することで、AVA-Interaction (AVA-I) と呼ばれる新しい HID ベンチマークを確立します。 AVA-I は 85,254 のフレームと 86,338 のインタラクティブグループで構成され、各画像には最大 4 つの同時インタラクティブグループが含まれます。 2 番目に、HID 用の新しいベースラインアプローチ SaMFormer を紹介します。これには、視覚的特徴抽出機能、Transformer ベースのモデルを活用してアクションインスタンスとインタラクティブグループをデコードする分割ステージ、およびインスタンスとグループ間の関係を再構築するマージステージが含まれます。すべての SaMFormer コンポーネントは、エンドツーエンドの方法で共同トレーニングされます。 AVA-I に関する広範な実験により、代表的な方法に対する SaMFormer の優位性が検証されています。データセットとコードは、さらなる追跡調査を促進するために公開される予定です。

A comprehensive understanding of interested human-to-human interactions in video streams, such as queuing, handshaking, fighting and chasing, is of immense importance to the surveillance of public security in regions like campuses, squares and parks. Different from conventional human interaction recognition, which uses choreographed videos as inputs, neglects concurrent interactive groups, and performs detection and recognition in separate stages, we introduce a new task named human-to-human interaction detection (HID). HID devotes to detecting subjects, recognizing person-wise actions, and grouping people according to their interactive relations, in one model. First, based on the popular AVA dataset created for action detection, we establish a new HID benchmark, termed AVA-Interaction (AVA-I), by adding annotations on interactive relations in a frame-by-frame manner. AVA-I consists of 85,254 frames and 86,338 interactive groups, and each image includes up to 4 concurrent interactive groups. Second, we present a novel baseline approach SaMFormer for HID, containing a visual feature extractor, a split stage which leverages a Transformer-based model to decode action instances and interactive groups, and a merging stage which reconstructs the relationship between instances and groups. All SaMFormer components are jointly trained in an end-to-end manner. Extensive experiments on AVA-I validate the superiority of SaMFormer over representative methods. The dataset and code will be made public to encourage more follow-up studies.

updated: Fri Aug 11 2023 10:08:46 GMT+0000 (UTC)

published: Sun Jul 02 2023 03:24:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト