Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices

Cho-Ying Wu; Ke Xu; Chin-Cheng Hsu; Ulrich Neumann

Voice2Mesh：音声からのクロスモーダル3D顔モデルの生成

この作品は、3D顔モデルが話者の音声入力のみから学習できるかどうかの分析に焦点を当てています。音声からのクロスモーダル顔合成研究画像生成のための以前の作品。ただし、画像合成には、髪型、背景、顔の質感などのバリエーションが含まれます。これらは、おそらく音声とは無関係であるか、相関関係を示すための直接的な調査がありません。代わりに、3D顔を再構築して、より生理学的に根拠のあるジオメトリのみに集中する機能を調査します。教師あり学習と教師なし学習の両方のフレームワークを提案します。特に、モデルに知識蒸留が装備されている場合、3D顔スキャンの限られた可用性の下で、直接の音声から3D顔へのデータセットがない場合に、教師なし学習がどのように可能であるかを示します。パフォーマンスを評価するために、ポイント、ライン、および領域に基づいて2つの3D面の幾何学的適合度を測定するためのいくつかのメトリックも提案します。 3Dの顔の形は声から再構築できることがわかりました。実験結果は、3D顔が音声から再構築できることを示唆しており、私たちの方法はベースラインを超えてパフォーマンスを向上させることができます。耳と耳の距離比メトリック（ER）で最高のパフォーマンスの向上（15％〜20％）は、話者の顔が全体的に広いか薄いかを人の声だけで大まかに想像できるという直感と一致します。コードとデータについては、プロジェクトページをご覧ください。

This work focuses on the analysis that whether 3D face models can be learned from only the speech inputs of speakers. Previous works for cross-modal face synthesis study image generation from voices. However, image synthesis includes variations such as hairstyles, backgrounds, and facial textures, that are arguably irrelevant to voice or without direct studies to show correlations. We instead investigate the ability to reconstruct 3D faces to concentrate on only geometry, which is more physiologically grounded. We propose both the supervised learning and unsupervised learning frameworks. Especially we demonstrate how unsupervised learning is possible in the absence of a direct voice-to-3D-face dataset under limited availability of 3D face scans when the model is equipped with knowledge distillation. To evaluate the performance, we also propose several metrics to measure the geometric fitness of two 3D faces based on points, lines, and regions. We find that 3D face shapes can be reconstructed from voices. Experimental results suggest that 3D faces can be reconstructed from voices, and our method can improve the performance over the baseline. The best performance gains (15% - 20%) on ear-to-ear distance ratio metric (ER) coincides with the intuition that one can roughly envision whether a speaker's face is overall wider or thinner only from a person's voice. See our project page for codes and data.

updated: Wed Apr 21 2021 01:14:50 GMT+0000 (UTC)

published: Wed Apr 21 2021 01:14:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト