Efficient Human Pose Estimation by Learning Deeply Aggregated Representations

Zhengxiong Luo; Zhicheng Wang; Yuanhao Cai; Guanan Wang; Yan Huang; Liang Wang; Erjin Zhou; Jian Sun

深く集約された表現を学習することによる効率的な人間の姿勢推定

本論文では、深く集約された表現を学習することにより、効率的な人間の姿勢推定ネットワーク（DANet）を提案します。ほとんどの既存のモデルは、主に異なる空間サイズのフィーチャからマルチスケール情報を探索します。強力なマルチスケール表現は通常、カスケードされたピラミッドフレームワークに依存しています。このフレームワークはパフォーマンスを大幅に向上させますが、その一方でネットワークを非常に深く複雑にします。代わりに、受容野のサイズが異なるレイヤーからのマルチスケール情報を活用し、融合方法を改善することでこの情報を最大限に活用することに焦点を当てています。具体的には、直交注意ブロック（OAB）と2次融合ユニット（SFU）を提案します。 OABは、さまざまなレイヤーからマルチスケール情報を学習し、それらの多様性を促進することによってそれらを強化します。 SFUは、多様なマルチスケール情報を適応的に選択して融合し、冗長な情報を抑制します。これにより、最終的な融合表現で効果的な情報を最大化できます。 OABとSFUの助けを借りて、私たちの単一のピラミッドネットワークは、カスケードネットワークよりもさらに豊富なマルチスケール情報を含み、より大きな表現能力を持つ、深く集約された表現を生成できる可能性があります。したがって、私たちのネットワークは、モデルの複雑さをはるかに小さくして、同等またはそれ以上の精度を達成できます。具体的には、DANet-72は、1.0GFLOPのみで設定されたCOCOtest-devで70.5のAPスコアを達成します。 CPUプラットフォームでの速度は、58人/秒〜（PPS）を達成します。

In this paper, we propose an efficient human pose estimation network (DANet) by learning deeply aggregated representations. Most existing models explore multi-scale information mainly from features with different spatial sizes. Powerful multi-scale representations usually rely on the cascaded pyramid framework. This framework largely boosts the performance but in the meanwhile makes networks very deep and complex. Instead, we focus on exploiting multi-scale information from layers with different receptive-field sizes and then making full of use this information by improving the fusion method. Specifically, we propose an orthogonal attention block (OAB) and a second-order fusion unit (SFU). The OAB learns multi-scale information from different layers and enhances them by encouraging them to be diverse. The SFU adaptively selects and fuses diverse multi-scale information and suppress the redundant ones. This could maximize the effective information in final fused representations. With the help of OAB and SFU, our single pyramid network may be able to generate deeply aggregated representations that contain even richer multi-scale information and have a larger representing capacity than that of cascaded networks. Thus, our networks could achieve comparable or even better accuracy with much smaller model complexity. Specifically, our DANet-72 achieves 70.5 in AP score on COCO test-dev set with only 1.0G FLOPs. Its speed on a CPU platform achieves 58 Persons-Per-Second~(PPS).

updated: Sun Dec 13 2020 10:58:07 GMT+0000 (UTC)

published: Sun Dec 13 2020 10:58:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト