An Efficient Multitask Neural Network for Face Alignment, Head Pose Estimation and Face Tracking

Jiahao Xia; Haimin Zhang; Shiping Wen; Shuo Yang; Min Xu

顔の位置合わせ、頭のポーズの推定、顔の追跡のための効率的なマルチタスクニューラルネットワーク

畳み込みニューラルネットワーク（CNN）は、顔に関連するアルゴリズムのパフォーマンスを大幅に向上させましたが、実際の使用で精度と効率を同時に維持することは依然として困難です。最近の研究によると、ボトムアップとトップダウンの畳み込み層の数で構成される砂時計モジュールのカスケードを使用すると、顔の位置合わせのために顔の構造情報を抽出して精度を向上させることができます。ただし、以前の研究では、浅い畳み込み層によって生成された特徴がエッジに高度に対応していることが示されています。これらの機能を直接使用して、追加コストなしで構造情報を提供できます。この直感に動機付けられて、効率的なマルチタスクの顔の位置合わせ、顔の追跡、頭のポーズの推定ネットワーク（ATPN）を提案します。具体的には、浅層フィーチャと深層フィーチャの間にショートカット接続を導入して、顔の位置合わせの構造情報を提供し、CoordConvを最後の数層に適用して座標情報を提供します。予測された顔のランドマークにより、頭のポーズを推定するための幾何学的情報と外観情報の両方を含む安価なヒートマップを生成でき、顔追跡の注意の手がかりも提供されます。さらに、顔追跡タスクにより、各フレームの顔検出手順が節約されます。これは、ビデオベースのタスクのパフォーマンスを向上させるために重要です。提案されたフレームワークは、4つのベンチマークデータセット、WFLW、300VW、WIDER Face、および300W-LPで評価されます。実験結果は、ATPNが以前の最先端の方法と比較して改善されたパフォーマンスを達成する一方で、パラメーターとFLOPSの数が少ないことを示しています。

While convolutional neural networks (CNNs) have significantly boosted the performance of face related algorithms, maintaining accuracy and efficiency simultaneously in practical use remains challenging. Recent study shows that using a cascade of hourglass modules which consist of a number of bottom-up and top-down convolutional layers can extract facial structural information for face alignment to improve accuracy. However, previous studies have shown that features produced by shallow convolutional layers are highly correspond to edges. These features could be directly used to provide the structural information without addition cost. Motivated by this intuition, we propose an efficient multitask face alignment, face tracking and head pose estimation network (ATPN). Specifically, we introduce a shortcut connection between shallow-layer features and deep-layer features to provide the structural information for face alignment and apply the CoordConv to the last few layers to provide coordinate information. The predicted facial landmarks enable us to generate a cheap heatmap which contains both geometric and appearance information for head pose estimation and it also provides attention clues for face tracking. Moreover, the face tracking task saves us the face detection procedure for each frame, which is significant to boost performance for video-based tasks. The proposed framework is evaluated on four benchmark datasets, WFLW, 300VW, WIDER Face and 300W-LP. The experimental results show that the ATPN achieves improved performance compared to previous state-of-the-art methods while having less number of parameters and FLOPS.

updated: Sat Mar 13 2021 04:41:15 GMT+0000 (UTC)

published: Sat Mar 13 2021 04:41:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト