Facial Video-based Remote Physiological Measurement via Self-supervised Learning

Zijie Yue; Miaojing Shi; Shuai Ding

自己教師あり学習による顔面ビデオベースの遠隔生理学的測定

顔ビデオベースの遠隔生理学的測定は、人間の顔ビデオから遠隔光電脈波計 (rPPG) 信号を推定し、rPPG 信号から複数のバイタルサイン (心拍数、呼吸周波数など) を測定することを目的としています。最近のアプローチでは、ディープニューラルネットワークをトレーニングすることでこれを実現していますが、これには通常、監視のために豊富な顔のビデオと同期して記録された光電脈波計 (PPG) 信号が必要です。しかし、実際には、これらの注釈付きコーパスを収集することは容易ではありません。この論文では、グラウンドトゥルース PPG 信号を必要とせずに、顔ビデオから rPPG 信号を推定する方法を学習する、周波数にインスピレーションを得た新しい自己教師ありフレームワークを紹介します。ビデオサンプルが与えられると、まずそれを元のサンプルと類似または異なる信号周波数を含む複数の正/負のサンプルに拡張します。具体的には、空間拡張を使用してポジティブサンプルが生成されます。負のサンプルは、視覚的な外観を過度に変えることなく、入力に対して非線形信号周波数変換を実行する学習可能な周波数拡張モジュールを介して生成されます。次に、拡張サンプルから rPPG 信号を推定するローカル rPPG エキスパート集約モジュールを導入します。さまざまな顔領域からの相補的な拍動情報をエンコードし、それらを 1 つの rPPG 予測に集約します。最後に、複数の拡張ビデオサンプルおよび時間的に隣接するビデオサンプル全体からの推定 rPPG 信号の最適化のために、一連の周波数に由来する損失、つまり周波数対比損失、周波数比一貫性損失、およびクロスビデオ周波数一致損失を提案します。当社では、4 つの標準ベンチマークで rPPG ベースの心拍数、心拍数変動、呼吸頻度の推定を実行します。実験結果は、私たちの方法が最先端技術を大幅に改善することを示しています。

Facial video-based remote physiological measurement aims to estimate remote photoplethysmography (rPPG) signals from human face videos and then measure multiple vital signs (e.g. heart rate, respiration frequency) from rPPG signals. Recent approaches achieve it by training deep neural networks, which normally require abundant facial videos and synchronously recorded photoplethysmography (PPG) signals for supervision. However, the collection of these annotated corpora is not easy in practice. In this paper, we introduce a novel frequency-inspired self-supervised framework that learns to estimate rPPG signals from facial videos without the need of ground truth PPG signals. Given a video sample, we first augment it into multiple positive/negative samples which contain similar/dissimilar signal frequencies to the original one. Specifically, positive samples are generated using spatial augmentation. Negative samples are generated via a learnable frequency augmentation module, which performs non-linear signal frequency transformation on the input without excessively changing its visual appearance. Next, we introduce a local rPPG expert aggregation module to estimate rPPG signals from augmented samples. It encodes complementary pulsation information from different face regions and aggregate them into one rPPG prediction. Finally, we propose a series of frequency-inspired losses, i.e. frequency contrastive loss, frequency ratio consistency loss, and cross-video frequency agreement loss, for the optimization of estimated rPPG signals from multiple augmented video samples and across temporally neighboring video samples. We conduct rPPG-based heart rate, heart rate variability and respiration frequency estimation on four standard benchmarks. The experimental results demonstrate that our method improves the state of the art by a large margin.

updated: Sat Jul 22 2023 07:21:11 GMT+0000 (UTC)

published: Thu Oct 27 2022 13:03:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト