Weakly-supervised 3D Human Pose Estimation with Cross-view U-shaped Graph Convolutional Network

Guoliang Hua; Hong Liu; Wenhao Li; Qian Zhang; Runwei Ding; Xin Xu

クロスビューU字型グラフ畳み込みネットワークを使用した弱く監視された3D人間のポーズ推定

単眼の3D人間ポーズ推定法は大きな進歩を遂げましたが、固有の深度のあいまいさのために解決にはほど遠いです。代わりに、マルチビュー情報を活用することは、絶対的な3D人間のポーズ推定を実現するための実用的な方法です。この論文では、弱く監視されたクロスビュー3D人間の姿勢推定のためのシンプルで効果的なパイプラインを提案します。 2つのカメラビューを使用するだけで、私たちの方法は、3Dグラウンドトゥルースを必要とせず、2Dアノテーションのみを必要とする、弱く監視された方法で最先端のパフォーマンスを実現できます。具体的には、この方法には、三角測量とリファインメントの2つのステップが含まれています。まず、従来の2D検出方法で取得できる2Dキーポイントを前提として、2つのビューにわたって三角測量を実行し、2Dキーポイントを粗い3Dポーズに持ち上げます。次に、空間構成とクロスビュー相関を探索できる新しいクロスビューU字型グラフ畳み込みネットワーク（CV-UGCN）を設計して、粗い3Dポーズを改良します。特に、洗練の進歩は、幾何学的および構造を意識した整合性チェックが実行される、弱教師あり学習によって達成されます。標準のベンチマークデータセットであるHuman3.6Mでメソッドを評価します。ベンチマークデータセットの関節あたりの平均位置誤差は27.4mmであり、既存の最先端の方法を大幅に上回っています（27.4mm対30.2mm）。

Although monocular 3D human pose estimation methods have made significant progress, it is far from being solved due to the inherent depth ambiguity. Instead, exploiting multi-view information is a practical way to achieve absolute 3D human pose estimation. In this paper, we propose a simple yet effective pipeline for weakly-supervised cross-view 3D human pose estimation. By only using two camera views, our method can achieve state-of-the-art performance in a weakly-supervised manner, requiring no 3D ground truth but only 2D annotations. Specifically, our method contains two steps: triangulation and refinement. First, given the 2D keypoints that can be obtained through any classic 2D detection methods, triangulation is performed across two views to lift the 2D keypoints into coarse 3D poses. Then, a novel cross-view U-shaped graph convolutional network (CV-UGCN), which can explore the spatial configurations and cross-view correlations, is designed to refine the coarse 3D poses. In particular, the refinement progress is achieved through weakly-supervised learning, in which geometric and structure-aware consistency checks are performed. We evaluate our method on the standard benchmark dataset, Human3.6M. The Mean Per Joint Position Error on the benchmark dataset is 27.4 mm, which outperforms existing state-of-the-art methods remarkably (27.4 mm vs 30.2 mm).

updated: Tue May 17 2022 10:04:03 GMT+0000 (UTC)

published: Sun May 23 2021 08:16:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト