Data standardization for robust lip sync

Chun Wang

堅牢なリップシンクのためのデータ標準化

リップシンクは基本的な視聴覚タスクです。しかし、既存のリップシンク方法は、実際に撮影された信じられないほど多様なビデオに対して堅牢であるとは言えず、多様性の大部分は、既存のリップシンク方法を低下させる可能性のある複合的な気を散らす要因によって引き起こされます.これらの問題に対処するために、この論文では、入力からの唇の動きの情報を保持し、複合的な気を散らす要因の影響を減らしながら、標準化された表現力豊かな画像を生成できるデータ標準化パイプラインを提案します。顔の 3D 再構築における最近の進歩に基づいて、唇の動きの情報が埋め込まれた、一貫して表情をほぐすことができるモデルを最初に作成します。次に、合成された画像に対する複合的な気を散らす要因の影響を減らすために、入力からの表現のみで画像を合成し、入力とは無関係に他のすべての属性を意図的に定義済みの値に設定します。合成画像を使用することで、既存のリップシンク手法はデータの効率と堅牢性を向上させ、アクティブスピーカー検出タスクで競争力のあるパフォーマンスを実現します。

Lip sync is a fundamental audio-visual task. However, existing lip sync methods fall short of being robust to the incredible diversity of videos taken in the wild, and the majority of the diversity is caused by compound distracting factors that could degrade existing lip sync methods. To address these issues, this paper proposes a data standardization pipeline that can produce standardized expressive images while preserving lip motion information from the input and reducing the effects of compound distracting factors. Based on recent advances in 3D face reconstruction, we first create a model that can consistently disentangle expressions, with lip motion information embedded. Then, to reduce the effects of compound distracting factors on synthesized images, we synthesize images with only expressions from the input, intentionally setting all other attributes at predefined values independent of the input. Using synthesized images, existing lip sync methods improve their data efficiency and robustness, and they achieve competitive performance for the active speaker detection task.

updated: Mon Jan 23 2023 14:50:26 GMT+0000 (UTC)

published: Sun Feb 13 2022 04:09:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト