DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

Runyang Feng; Yixing Gao; Tze Ho Elden Tse; Xueqing Ma; Hyung Jin Chang

DiffPose: ビデオベースの人間の姿勢推定のための時空間拡散モデル

当初リアルな画像生成のために提案されたノイズ除去拡散確率モデルは、最近さまざまな認識タスク (物体検出や画像セグメンテーションなど) で成功を収め、コンピュータビジョンでもますます注目を集めています。ただし、ビデオには追加の時間次元が存在するため、このようなモデルをマルチフレームの人間の姿勢推定に拡張することは簡単ではありません。さらに重要なことは、人間の関節の位置を正確に特定するには、キーポイント領域に焦点を当てた表現を学習することが重要です。それにもかかわらず、拡散ベースの方法の適応は、そのような目的をどのように達成するかについては不明のままである。この論文では、ビデオベースの人間の姿勢推定を条件付きヒートマップ生成問題として定式化する新しい拡散アーキテクチャである DiffPose を紹介します。まず、時間情報をより有効に活用するために、フレーム全体の視覚的証拠を集約し、各ノイズ除去ステップで得られた特徴を条件として使用する時空間表現学習器を提案します。さらに、複数のスケールにわたるローカルジョイントとグローバルコンテキストの間の相関関係を決定する、ルックアップベースのマルチスケールフィーチャインタラクションと呼ばれるメカニズムを紹介します。この機構により、キーポイント領域に焦点を当てた繊細な表現が生成されます。まとめると、拡散モデルを拡張することで、姿勢推定タスクに関する DiffPose の 2 つの独自の特性を示します。(i) 複数の姿勢推定セットを組み合わせて、特に困難な関節の予測精度を向上させる機能、および (ii) 姿勢推定の精度を調整する機能モデルを再トレーニングせずに特徴を改善するための反復ステップの数。 DiffPose は、PoseTrack2017、PoseTrack2018、および PoseTrack21 の 3 つのベンチマークで新しい最先端の結果を設定します。

Denoising diffusion probabilistic models that were initially proposed for realistic image generation have recently shown success in various perception tasks (e.g., object detection and image segmentation) and are increasingly gaining attention in computer vision. However, extending such models to multi-frame human pose estimation is non-trivial due to the presence of the additional temporal dimension in videos. More importantly, learning representations that focus on keypoint regions is crucial for accurate localization of human joints. Nevertheless, the adaptation of the diffusion-based methods remains unclear on how to achieve such objective. In this paper, we present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. First, to better leverage temporal information, we propose SpatioTemporal Representation Learner which aggregates visual evidences across frames and uses the resulting features in each denoising step as a condition. In addition, we present a mechanism called Lookup-based MultiScale Feature Interaction that determines the correlations between local joints and global contexts across multiple scales. This mechanism generates delicate representations that focus on keypoint regions. Altogether, by extending diffusion models, we show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model. DiffPose sets new state-of-the-art results on three benchmarks: PoseTrack2017, PoseTrack2018, and PoseTrack21.

updated: Mon Jul 31 2023 14:00:23 GMT+0000 (UTC)

published: Mon Jul 31 2023 14:00:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト