Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation

Yihan Wang; Muyang Li; Han Cai; Wei-Ming Chen; Song Han

Liteポーズ：2D人間ポーズ推定のための効率的なアーキテクチャ設計

ポーズ推定は、人間中心のビジョンアプリケーションで重要な役割を果たします。ただし、計算コストが高い（フレームあたり150 GMACを超える）ため、リソースに制約のあるエッジデバイスに最先端のHRNetベースのポーズ推定モデルを展開することは困難です。この論文では、エッジでのリアルタイムの複数人のポーズ推定のための効率的なアーキテクチャ設計を研究します。 HRNetの高解像度ブランチは、段階的な縮小実験を通じて、低計算領域のモデルに対して冗長であることを明らかにします。それらを削除すると、効率とパフォーマンスの両方が向上します。この発見に触発されて、ポーズ推定のための効率的な単一ブランチアーキテクチャであるLitePoseを設計し、FusionDeconvHeadとLargeKernelConvsを含むLitePoseの容量を強化する2つの簡単なアプローチを紹介します。 Fusion Deconv Headは、高解像度ブランチの冗長性を排除し、オーバーヘッドの少ないスケール対応機能の融合を可能にします。大規模なカーネル変換は、低い計算コストを維持しながら、モデルの容量と受容野を大幅に改善します。わずか25％の計算増分で、7x7カーネルはCrowdPoseデータセットの3x3カーネルよりも+14.0mAPを達成します。モバイルプラットフォームでは、LitePoseは、以前の最先端の効率的なポーズ推定モデルと比較して、パフォーマンスを犠牲にすることなくレイテンシを最大5.0倍削減し、リアルタイムの複数人のポーズ推定のフロンティアをエッジに押し上げます。コードと事前トレーニング済みモデルは、https：//github.com/mit-han-lab/liteposeでリリースされています。

Pose estimation plays a critical role in human-centered vision applications. However, it is difficult to deploy state-of-the-art HRNet-based pose estimation models on resource-constrained edge devices due to the high computational cost (more than 150 GMACs per frame). In this paper, we study efficient architecture design for real-time multi-person pose estimation on edge. We reveal that HRNet's high-resolution branches are redundant for models at the low-computation region via our gradual shrinking experiments. Removing them improves both efficiency and performance. Inspired by this finding, we design LitePose, an efficient single-branch architecture for pose estimation, and introduce two simple approaches to enhance the capacity of LitePose, including Fusion Deconv Head and Large Kernel Convs. Fusion Deconv Head removes the redundancy in high-resolution branches, allowing scale-aware feature fusion with low overhead. Large Kernel Convs significantly improve the model's capacity and receptive field while maintaining a low computational cost. With only 25% computation increment, 7x7 kernels achieve +14.0 mAP better than 3x3 kernels on the CrowdPose dataset. On mobile platforms, LitePose reduces the latency by up to 5.0x without sacrificing performance, compared with prior state-of-the-art efficient pose estimation models, pushing the frontier of real-time multi-person pose estimation on edge. Our code and pre-trained models are released at https://github.com/mit-han-lab/litepose.

updated: Mon Jul 11 2022 16:17:22 GMT+0000 (UTC)

published: Tue May 03 2022 02:08:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト