RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild

Rafael Berral-Soler; Francisco J. Madrid-Cuevas; Rafael Muñoz-Salinas; Manuel J. Marín-Jiménez

RealHePoNet：野生の頭のポーズを推定するための堅牢なシングルステージConvNet

画像内の人間の頭の姿勢の推定は、人間とコンピュータの相互作用やビデオ監視タスクなど、多くの分野で応用されています。この作業では、単一の畳み込みニューラルネットワーク（ConvNet）モデルを使用して、垂直（傾斜/ピッチ）角度と水平（パン/ヨー）角度の両方の推定として定義されるこの問題に対処し、精度と実際のアプリケーションでの使いやすさを最大化するための推論速度。私たちのモデルは、2つのデータセットの組み合わせでトレーニングされています：「ポインティング「04」（幅広いポーズをカバーすることを目的としています）」と「注釈付きの野生の顔のランドマーク」（実際に使用するためのモデルの堅牢性を向上させるため）世界の画像）。結合されたデータセットの3つの異なるパーティションが定義され、トレーニング、検証、およびテストの目的で使用されます。この作業の結果、低解像度のグレースケール入力画像が与えられ、顔のランドマークを使用せずに、傾斜角とパン角の両方を低エラーで推定できる、トレーニング済みのConvNetモデル（RealHePoNet）が得られました（テストパーティションの平均誤差は約4.4°）。また、推論時間が短い（1ヘッドあたり約6ミリ秒）ため、中仕様のハードウェア（GTX 1060 GPUなど）と組み合わせた場合でも、モデルは使用可能であると考えています。 *コードはhttps://github.com/rafabs97/headpose_finalで入手できます*デモビデオはhttps://www.youtube.com/watch?v=2UeuXh5DjAEで入手できます

Human head pose estimation in images has applications in many fields such as human-computer interaction or video surveillance tasks. In this work, we address this problem, defined here as the estimation of both vertical (tilt/pitch) and horizontal (pan/yaw) angles, through the use of a single Convolutional Neural Network (ConvNet) model, trying to balance precision and inference speed in order to maximize its usability in real-world applications. Our model is trained over the combination of two datasets: 'Pointing'04' (aiming at covering a wide range of poses) and 'Annotated Facial Landmarks in the Wild' (in order to improve robustness of our model for its use on real-world images). Three different partitions of the combined dataset are defined and used for training, validation and testing purposes. As a result of this work, we have obtained a trained ConvNet model, coined RealHePoNet, that given a low-resolution grayscale input image, and without the need of using facial landmarks, is able to estimate with low error both tilt and pan angles (~4.4° average error on the test partition). Also, given its low inference time (~6 ms per head), we consider our model usable even when paired with medium-spec hardware (i.e. GTX 1060 GPU). * Code available at: https://github.com/rafabs97/headpose_final * Demo video at: https://www.youtube.com/watch?v=2UeuXh5DjAE

updated: Tue Nov 03 2020 18:09:05 GMT+0000 (UTC)

published: Tue Nov 03 2020 18:09:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト