3D Human Pose Estimation with Spatial and Temporal Transformers

Ce Zheng; Sijie Zhu; Matias Mendieta; Taojiannan Yang; Chen Chen; Zhengming Ding

空間および時間トランスフォーマーを使用した3D人間の姿勢推定

Transformerアーキテクチャは、自然言語処理で選択されるモデルになり、現在、画像分類、オブジェクト検出、セマンティックセグメンテーションなどのコンピュータビジョンタスクに導入されています。ただし、人間の姿勢推定の分野では、畳み込みアーキテクチャが依然として支配的です。この作業では、畳み込みアーキテクチャを使用せずに、ビデオで3D人間の姿勢を推定するための純粋なトランスベースのアプローチであるPoseFormerを紹介します。ビジョントランスフォーマーの最近の開発に触発されて、各フレーム内の人間の関節関係とフレーム間の時間的相関を包括的にモデル化する時空間トランスフォーマー構造を設計し、中央フレームの正確な3D人間ポーズを出力します。 Human3.6MとMPI-INF-3DHPの2つの一般的で標準的なベンチマークデータセットで、メソッドを定量的および定性的に評価します。広範な実験により、PoseFormerは両方のデータセットで最先端のパフォーマンスを達成していることが示されています。コードはhttps://github.com/zczcwh/PoseFormerで入手できます。

Transformer architectures have become the model of choice in natural language processing and are now being introduced into computer vision tasks such as image classification, object detection, and semantic segmentation. However, in the field of human pose estimation, convolutional architectures still remain dominant. In this work, we present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure to comprehensively model the human joint relations within each frame as well as the temporal correlations across frames, then output an accurate 3D human pose of the center frame. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments show that PoseFormer achieves state-of-the-art performance on both datasets. Code is available at https://github.com/zczcwh/PoseFormer

updated: Sun Aug 22 2021 00:15:03 GMT+0000 (UTC)

published: Thu Mar 18 2021 18:14:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト