Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Sen Yang; Wen Heng; Gang Liu; Guozhong Luo; Wankou Yang; Gang Yu

すべての関節の動きのキャプチャ: 独立したトークンを使用した 3D 人間の姿勢と形状の推定

この論文では、単眼ビデオから 3D 人間の姿勢と形状を推定する新しい方法を提示します。このタスクでは、単眼画像またはビデオからピクセルアラインメントの 3D 人間のポーズと体型を直接復元する必要がありますが、固有のあいまいさのために困難です。精度を向上させるために、既存の方法は、初期化された平均姿勢と形状を事前推定値として、また反復エラーフィードバック方式によるパラメーター回帰に大きく依存しています。さらに、ビデオベースのアプローチは、画像レベルの特徴の全体的な変化をモデル化して、単一フレームの特徴を一時的に強化しますが、関節レベルで回転運動を捉えることができず、局所的な時間的一貫性を保証できません。これらの問題に対処するために、独立したトークンの設計を備えた新しい Transformer ベースのモデルを提案します。まず、画像の特徴に依存しない 3 種類のトークン (関節回転トークン、形状トークン、カメラトークン) を紹介します。これらのトークンは、Transformer レイヤーを介して画像の特徴と段階的にやり取りすることで、大規模データからの人間の 3D 関節の回転、体の形状、および位置情報に関する事前知識をエンコードすることを学習し、特定の画像で調整された SMPL パラメーターを推定するために更新されます。第二に、提案されたトークンベースの表現の恩恵を受けて、さらに時間モデルを使用して、各関節の回転時間情報をキャプチャすることに焦点を当てます。概念的に単純であるにもかかわらず、提案された方法は、3DPW および Human3.6M データセットで優れたパフォーマンスを達成します。 ResNet-50 および Transformer アーキテクチャを使用して、困難な 3DPW の PA-MPJPE メトリックで 42.0 mm の誤差を取得し、最先端の対応物を大幅に上回っています。コードは https://github.com/yangsenius/INT_HMR_Model で公開されます

In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: joint rotation tokens, shape token, and camera token. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at https://github.com/yangsenius/INT_HMR_Model

updated: Wed Mar 01 2023 07:48:01 GMT+0000 (UTC)

published: Wed Mar 01 2023 07:48:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト