Relation-Based Associative Joint Location for Human Pose Estimation in Videos

Yonghao Dang; Jianqin Yin; Shaojie Zhang

ビデオにおける人間の姿勢推定のための関係ベースの連想関節位置

ビデオベースの人間のポーズ推定（VHPE）は、重要でありながら挑戦的なタスクです。深層学習法はVHPEで大きな進歩を遂げましたが、このタスクへのほとんどのアプローチは、畳み込みの受容野を拡大することにより、関節間の長距離相互作用を暗黙的にモデル化します。以前の方法とは異なり、関節間の連想関係を明示的かつ自動的にモデル化するために、軽量でプラグアンドプレイの関節関係エクストラクタ（JRE）を設計します。 JREは、ジョイントの疑似ヒートマップを入力として受け取り、疑似ヒートマップ間の類似性を計算します。このようにして、JREは任意の2つの関節間の関係を柔軟に学習し、人間のポーズの豊かな空間構成を学習できるようにします。さらに、JREは、関節間の関係に従って不可視の関節を推測できます。これは、モデルが閉塞した関節を特定するのに役立ちます。次に、時間的意味連続性モデリングと組み合わせて、ビデオベースの人間のポーズ推定のための関係ベースのポーズ意味伝達ネットワーク（RPSTN）を提案します。具体的には、ポーズの時間的ダイナミクスをキャプチャするために、現在のフレームのポーズセマンティック情報が、ジョイントリレーションガイド付きポーズセマンティクスプロパゲーター（JRPSP）を使用して次のフレームに転送されます。提案されたモデルは、ポーズの意味的特徴を非閉塞フレームから閉塞フレームに転送することができ、我々の方法を閉塞に対してロバストにする。さらに、提案されたJREモジュールは、画像ベースの人間の姿勢推定にも適しています。提案されたRPSTNは、ビデオベースのPenn Actionデータセット、Sub-JHMDBデータセット、およびPoseTrack2018データセットで最先端の結果を達成します。さらに、提案されたJREは、画像ベースのCOCO2017データセットのバックボーンのパフォーマンスを向上させます。コードはhttps://github.com/YHDang/pose-estimationで入手できます。

Video-based human pose estimation (VHPE) is a vital yet challenging task. While deep learning methods have made significant progress for the VHPE, most approaches to this task implicitly model the long-range interaction between joints by enlarging the receptive field of the convolution. Unlike prior methods, we design a lightweight and plug-and-play joint relation extractor (JRE) to model the associative relationship between joints explicitly and automatically. The JRE takes the pseudo heatmaps of joints as input and calculates the similarity between pseudo heatmaps. In this way, the JRE flexibly learns the relationship between any two joints, allowing it to learn the rich spatial configuration of human poses. Moreover, the JRE can infer invisible joints according to the relationship between joints, which is beneficial for the model to locate occluded joints. Then, combined with temporal semantic continuity modeling, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) for video-based human pose estimation. Specifically, to capture the temporal dynamics of poses, the pose semantic information of the current frame is transferred to the next with a joint relation guided pose semantics propagator (JRPSP). The proposed model can transfer the pose semantic features from the non-occluded frame to the occluded frame, making our method robust to the occlusion. Furthermore, the proposed JRE module is also suitable for image-based human pose estimation. The proposed RPSTN achieves state-of-the-art results on the video-based Penn Action dataset, Sub-JHMDB dataset, and PoseTrack2018 dataset. Moreover, the proposed JRE improves the performance of backbones on the image-based COCO2017 dataset. Code is available at https://github.com/YHDang/pose-estimation.

updated: Fri Jun 30 2023 09:52:30 GMT+0000 (UTC)

published: Thu Jul 08 2021 04:05:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト