Unsupervised Temporal Learning on Monocular Videos for 3D Human Pose Estimation

Sina Honari; Victor Constantin; Helge Rhodin; Mathieu Salzmann; Pascal Fua

3D人間のポーズ推定のための単眼ビデオの教師なし時間学習

この論文では、単眼ビデオの時間情報を抽出する教師なし学習方法を提案します。ここでは、各フレームで関心のある対象を検出してエンコードし、対照的な自己監視（CSS）学習を活用して豊富な潜在ベクトルを抽出します。他のCSSアプローチのように、近くのフレームの潜在的特徴を正のペアとして扱い、時間的に離れたフレームの潜在的特徴を負のペアとして扱うのではなく、各潜在ベクトルを時変成分と時不変成分に明示的に解きほぐします。次に、CSSを時変特徴にのみ適用し、入力を再構築しながら、近くのフレームと離れたフレームの間で徐々に遷移するように促し、人間のポーズ推定に適した豊富な時間的特徴を時変コンポーネントに抽出することを示します。私たちのアプローチは、標準のCSS戦略と比較してエラーを約50％削減し、他の教師なしシングルビュー手法よりも優れており、マルチビュー手法のパフォーマンスに匹敵します。

In this paper we propose an unsupervised learning method to extract temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying CSS only to the time-variant features and encouraging a gradual transition on them between nearby and away frames while also reconstructing the input, extract rich temporal features into the time-variant component, well-suited for human pose estimation. Our approach reduces error by about 50% compared to the standard CSS strategies, outperforms other unsupervised single-view methods and matches the performance of multi-view techniques.

updated: Thu Apr 14 2022 13:42:57 GMT+0000 (UTC)

published: Wed Dec 02 2020 20:27:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト