Conditional Directed Graph Convolution for 3D Human Pose Estimation

Wenbo Hu; Changgong Zhang; Fangneng Zhan; Lei Zhang; Tien-Tsin Wong

3D人間の姿勢推定のための条件付き有向グラフ畳み込み

グラフ畳み込みネットワークは、人間の骨格を無向グラフとして表すことにより、3D人間の姿勢の推定を大幅に改善しました。ただし、この表現は、関節間の階層順序が明示的に提示されていないため、人間の骨格の明確な特性を反映していません。この論文では、人間の骨格を有向グラフとして表現し、関節をノードとして、骨をエッジとして親関節から子関節に向けることを提案します。そうすることで、エッジの方向はノード間の階層関係を明示的に反映できます。この表現に基づいて、時空間有向グラフ畳み込み（ST-DGConv）を採用して、有向グラフの時間シーケンスで表される2Dポーズから特徴を抽出します。さらに、入力ポーズでグラフトポロジを調整することにより、さまざまなポーズのさまざまな非局所依存性を活用するために、時空間条件付き有向グラフ畳み込み（ST-CondDGConv）を提案します。全体として、単眼ビデオからの3D人間姿勢推定のために、ST-DGConv層とST-CondDGConv層を備えたU字型ネットワークを形成します。これはU字型条件付き有向グラフ畳み込みネットワーク（U-CondDGCN）と呼ばれます。 U-CondDGCNの有効性を評価するために、Human3.6MとMPI-INF-3DHPという2つの挑戦的な大規模ベンチマークで広範な実験を実施しました。定量的および定性的な結果の両方が、私たちの方法が最高のパフォーマンスを達成していることを示しています。また、アブレーション研究は、有向グラフが無向グラフよりも関節のある人間の骨格の階層をうまく活用できること、および条件付き接続がさまざまな種類のポーズに適応するグラフトポロジを生成できることを示しています。

Graph convolutional networks have significantly improved 3D human pose estimation by representing the human skeleton as an undirected graph. However, this representation fails to reflect the articulated characteristic of human skeletons as the hierarchical orders among the joints are not explicitly presented. In this paper, we propose to represent the human skeleton as a directed graph with the joints as nodes and bones as edges that are directed from parent joints to child joints. By so doing, the directions of edges can explicitly reflect the hierarchical relationships among the nodes. Based on this representation, we adopt the spatial-temporal directed graph convolution (ST-DGConv) to extract features from 2D poses represented in a temporal sequence of directed graphs. We further propose a spatial-temporal conditional directed graph convolution (ST-CondDGConv) to leverage varying non-local dependence for different poses by conditioning the graph topology on input poses. Altogether, we form a U-shaped network with ST-DGConv and ST-CondDGConv layers, named U-shaped Conditional Directed Graph Convolutional Network (U-CondDGCN), for 3D human pose estimation from monocular videos. To evaluate the effectiveness of our U-CondDGCN, we conducted extensive experiments on two challenging large-scale benchmarks: Human3.6M and MPI-INF-3DHP. Both quantitative and qualitative results show that our method achieves top performance. Also, ablation studies show that directed graphs can better exploit the hierarchy of articulated human skeletons than undirected graphs, and the conditional connections can yield adaptive graph topologies for different kinds of poses.

updated: Fri Jul 16 2021 09:50:40 GMT+0000 (UTC)

published: Fri Jul 16 2021 09:50:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト