Learning Heatmap-Style Jigsaw Puzzles Provides Good Pretraining for 2D Human Pose Estimation

Kun Zhang; Rui Wu; Ping Yao; Kai Deng; Ding Li; Renbiao Liu; Chuanguang Yang; Ge Chen; Min Du; Tianyao Zheng

ヒートマップスタイルのジグソーパズルを学ぶことは、2D人間のポーズ推定のための優れた事前トレーニングを提供します

2D人間の姿勢推定のターゲットは、入力された2D画像から身体部分のキーポイントを見つけることです。ポーズ推定の最先端の方法は、通常、畳み込みニューラルネットワークを学習するためのラベルとしてキーポイントからピクセル単位のヒートマップを構築します。これは通常、ランダムに初期化されるか、ImageNetの分類モデルをバックボーンとして使用します。 2Dポーズ推定タスクは、画像パッチ間のコンテキスト関係に大きく依存することに注意してください。したがって、2Dポーズ推定ネットワークを事前トレーニングするための自己教師あり方法を導入します。具体的には、ヒートマップスタイルのジグソーパズル（HSJP）問題を口実タスクとして提案します。そのターゲットは、シャッフルされたパッチで構成される画像から各パッチの位置を学習することです。事前トレーニングプロセスでは、MS-COCOの人物インスタンスの画像のみを使用し、余分ではるかに大きなImageNetデータセットを導入します。パッチの場所のヒートマップスタイルのラベルが設計されており、学習プロセスは対照的ではありません。 HSJP口実タスクによって学習された重みは、2D人間ポーズ推定器のバックボーンとして利用され、MS-COCO人間キーポイントデータセットで微調整されます。 2つの人気のある強力な2D人間ポーズ推定器、HRNetとSimpleBaselineを使用して、MS-COCO検証とtest-devデータセットの両方でmAPスコアを評価します。私たちの実験は、自己教師あり事前トレーニングを使用したダウンストリームポーズ推定器が、最初からトレーニングしたものよりもはるかに優れたパフォーマンスを取得し、初期バックボーンとしてImageNet分類モデルを使用したものに匹敵することを示しています。

The target of 2D human pose estimation is to locate the keypoints of body parts from input 2D images. State-of-the-art methods for pose estimation usually construct pixel-wise heatmaps from keypoints as labels for learning convolution neural networks, which are usually initialized randomly or using classification models on ImageNet as their backbones. We note that 2D pose estimation task is highly dependent on the contextual relationship between image patches, thus we introduce a self-supervised method for pretraining 2D pose estimation networks. Specifically, we propose Heatmap-Style Jigsaw Puzzles (HSJP) problem as our pretext-task, whose target is to learn the location of each patch from an image composed of shuffled patches. During our pretraining process, we only use images of person instances in MS-COCO, rather than introducing extra and much larger ImageNet dataset. A heatmap-style label for patch location is designed and our learning process is in a non-contrastive way. The weights learned by HSJP pretext task are utilised as backbones of 2D human pose estimator, which are then finetuned on MS-COCO human keypoints dataset. With two popular and strong 2D human pose estimators, HRNet and SimpleBaseline, we evaluate mAP score on both MS-COCO validation and test-dev datasets. Our experiments show that downstream pose estimators with our self-supervised pretraining obtain much better performance than those trained from scratch, and are comparable to those using ImageNet classification models as their initial backbones.

updated: Sun Dec 13 2020 17:04:29 GMT+0000 (UTC)

published: Sun Dec 13 2020 17:04:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト