Knowing When to Quit: Selective Cascaded Regression with Patch Attention for Real-Time Face Alignment

Gil Shapira; Noga Levy; Ishay Goldin; Roy J. Jevnisek

いつ終了するかを知る：リアルタイムの顔の位置合わせのためのパッチ注意を伴う選択的カスケード回帰

顔のランドマーク（FLM）の推定は、多くの顔関連のアプリケーションで重要なコンポーネントです。この作業では、精度と速度の両方を最適化し、それらの間のトレードオフを調査することを目指しています。私たちの重要な観察は、すべての面が同じように作成されているわけではないということです。ニュートラルな表情の正面の顔は、極端なポーズや表情の顔よりも速く収束します。サンプルを区別するために、各反復後の回帰エラーを予測するようにモデルをトレーニングします。現在の反復が十分に正確である場合、反復を停止し、精度をチェックしながら冗長な反復を保存します。また、隣接するパッチが重なっているため、精度を大幅に犠牲にすることなく、少数のパッチですべての顔のランドマーク（FLM）を推測できることもわかりました。アーキテクチャ的には、パッチ自体の情報に基づいてパッチの均等化を計算し、パッチ機能の表現力を強化する、きめ細かいローカルパッチアテンションモジュールを備えた、マルチスケールのパッチベースの軽量特徴抽出機能を提供します。パッチ注意データを分析して、顔のランドマークを回帰するときにモデルがどこに参加しているかを推測し、それを人間の顔の注意と比較します。私たちのモデルは、モバイルデバイスGPUでリアルタイムに実行され、95メガマルチプライアド（MMA）操作で、1000 MMAの下ですべての最先端の方法を上回り、300Wの挑戦的なデータセットで8.16の正規化された平均エラーがあります。

Facial landmarks (FLM) estimation is a critical component in many face-related applications. In this work, we aim to optimize for both accuracy and speed and explore the trade-off between them. Our key observation is that not all faces are created equal. Frontal faces with neutral expressions converge faster than faces with extreme poses or expressions. To differentiate among samples, we train our model to predict the regression error after each iteration. If the current iteration is accurate enough, we stop iterating, saving redundant iterations while keeping the accuracy in check. We also observe that as neighboring patches overlap, we can infer all facial landmarks (FLMs) with only a small number of patches without a major accuracy sacrifice. Architecturally, we offer a multi-scale, patch-based, lightweight feature extractor with a fine-grained local patch attention module, which computes a patch weighting according to the information in the patch itself and enhances the expressive power of the patch features. We analyze the patch attention data to infer where the model is attending when regressing facial landmarks and compare it to face attention in humans. Our model runs in real-time on a mobile device GPU, with 95 Mega Multiply-Add (MMA) operations, outperforming all state-of-the-art methods under 1000 MMA, with a normalized mean error of 8.16 on the 300W challenging dataset.

updated: Sun Aug 01 2021 06:51:47 GMT+0000 (UTC)

published: Sun Aug 01 2021 06:51:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト