EfficientPose: Efficient Human Pose Estimation with Neural Architecture Search

Wenqiang Zhang; Jiemin Fang; Xinggang Wang; Wenyu Liu

EfficientPose：ニューラルアーキテクチャ検索による効率的な人間のポーズ推定

画像とビデオからの人間の姿勢の推定は、多くのマルチメディアアプリケーションで重要なタスクです。以前の方法では優れたパフォーマンスが得られますが、効率が考慮されることはめったにないため、リソースに制約のあるデバイスにネットワークを実装することは困難です。今日、リアルタイムマルチメディアアプリケーションは、より良い相互作用のためのより効率的なモデルを必要としています。さらに、ポーズ推定用のほとんどのディープニューラルネットワークは、画像分類用に設計されたネットワークをバックボーンとして直接再利用しますが、ポーズ推定タスク用にまだ最適化されていません。本論文では、効率的な背骨と効率的な頭部の2つの部分を含む、人間の姿勢推定を対象とした効率的なフレームワークを提案します。微分可能なニューラルアーキテクチャ検索方法を実装することにより、姿勢推定のためのバックボーンネットワーク設計をカスタマイズし、精度の低下を無視して計算コストを削減します。効率的なヘッドのために、転置された畳み込みをスリム化し、最終予測のパフォーマンスを促進するための空間情報補正モジュールを提案します。実験では、MPIIおよびCOCOデータセットでネットワークを評価します。私たちの最小モデルはMPIIで88.1％PCKh@0.5の0.65 GFLOPしかなく、大モデルは2 GFLOPしかありませんが、その精度は最先端の大モデル、つまり9.5GFLOPのHRNetと競合しています。

Human pose estimation from image and video is a vital task in many multimedia applications. Previous methods achieve great performance but rarely take efficiency into consideration, which makes it difficult to implement the networks on resource-constrained devices. Nowadays real-time multimedia applications call for more efficient models for better interactions. Moreover, most deep neural networks for pose estimation directly reuse the networks designed for image classification as the backbone, which are not yet optimized for the pose estimation task. In this paper, we propose an efficient framework targeted at human pose estimation including two parts, the efficient backbone and the efficient head. By implementing the differentiable neural architecture search method, we customize the backbone network design for pose estimation and reduce the computation cost with negligible accuracy degradation. For the efficient head, we slim the transposed convolutions and propose a spatial information correction module to promote the performance of the final prediction. In experiments, we evaluate our networks on the MPII and COCO datasets. Our smallest model has only 0.65 GFLOPs with 88.1% PCKh@0.5 on MPII and our large model has only 2 GFLOPs while its accuracy is competitive with the state-of-the-art large model, i.e., HRNet with 9.5 GFLOPs.

updated: Sun Dec 13 2020 15:38:38 GMT+0000 (UTC)

published: Sun Dec 13 2020 15:38:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト