Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation

Xiaolong Shen; Zongxin Yang; Xiaohan Wang; Jianxin Ma; Chang Zhou; Yi Yang

ビデオベースの 3D 人間の姿勢と形状推定のためのグローバルからローカルへのモデリング

ビデオベースの 3D 人間の姿勢と形状の推定は、フレーム内の精度とフレーム間の滑らかさによって評価されます。これら 2 つのメトリクスは異なる範囲の時間的一貫性に関与していますが、既存の最先端の方法はそれらを統一された問題として扱い、単調なモデリング構造 (RNN や注意ベースのブロックなど) を使用してネットワークを設計します。ただし、単一の種類のモデリング構造を使用することは、短期および長期の時間的相関の学習のバランスを取ることが難しく、ネットワークをそれらのいずれかに偏らせる可能性があり、グローバルな位置のシフト、時間的な不一致、および不十分な予測などの望ましくない予測につながります。地元の詳細。これらの問題を解決するために、エンドツーエンドのフレームワークである Global-to-Local Transformer (GLoT) で、長期的および短期的な相関関係のモデリングを構造的に分離することを提案します。最初に、グローバルトランスフォーマーが、長期モデリング用のマスクポーズおよび形状推定戦略と共に導入されます。この戦略は、複数のフレームの特徴をランダムにマスキングすることにより、グローバルトランスフォーマーを刺激して、より多くのフレーム間相関を学習させます。次に、ローカルトランスフォーマーは、ヒューマンメッシュ上のローカルの詳細を活用し、クロスアテンションを活用してグローバルトランスフォーマーと対話します。さらに、階層空間相関リグレッサーがさらに導入され、分離されたグローバル/ローカル表現と暗黙の運動学的制約によってフレーム内推定を改善します。私たちの GLoT は、3DPW、MPI-INF-3DHP、Human3.6M などの一般的なベンチマークで最も低いモデルパラメーターを使用して、以前の最先端の方法を上回っています。コードは https://github.com/sxl142/GLoT で入手できます。

Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. Although these two metrics are responsible for different ranges of temporal consistency, existing state-of-the-art methods treat them as a unified problem and use monotonous modeling structures (e.g., RNN or attention-based block) to design their networks. However, using a single kind of modeling structure is difficult to balance the learning of short-term and long-term temporal correlations, and may bias the network to one of them, leading to undesirable predictions like global location shift, temporal inconsistency, and insufficient local details. To solve these problems, we propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT). First, a global transformer is introduced with a Masked Pose and Shape Estimation strategy for long-term modeling. The strategy stimulates the global transformer to learn more inter-frame correlations by randomly masking the features of several frames. Second, a local transformer is responsible for exploiting local details on the human mesh and interacting with the global transformer by leveraging cross-attention. Moreover, a Hierarchical Spatial Correlation Regressor is further introduced to refine intra-frame estimations by decoupled global-local representation and implicit kinematic constraints. Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M. Codes are available at https://github.com/sxl142/GLoT.

updated: Sun Mar 26 2023 14:57:49 GMT+0000 (UTC)

published: Sun Mar 26 2023 14:57:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト