DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video

Zhimeng Zhang; Zhipeng Hu; Wenjin Deng; Changjie Fan; Tangjie Lv; Yu Ding

DINet: 高解像度ビデオにリアルな顔を視覚的にダビングするための変形修復ネットワーク

少数ショットの学習では、高解像度のビデオに視覚的に吹き替えを行う写真のようにリアルな顔を実現することは、依然として重要な課題です。以前の作品は、忠実度の高い吹き替え結果を生成できませんでした。上記の問題に対処するために、この論文では、高解像度の顔の視覚的な吹き替えのための変形インペインティングネットワーク (DINet) を提案します。潜在的な埋め込みからピクセルを直接生成するために複数のアップサンプルレイヤーに依存する以前の作業とは異なり、DINet は参照画像の特徴マップに対して空間変形を実行して、高周波のテクスチャの詳細をより適切に保持します。具体的には、DINet は 1 つの変形パーツと 1 つの修復パーツで構成されます。最初の部分では、5 つの参照顔画像が適応的に空間変形を実行し、各フレームで口の形状をエンコードする変形された特徴マップを作成します。これは、入力された運転音と入力ソース画像の頭のポーズに合わせるためです。 2 番目の部分では、顔の視覚的な吹き替えを生成するために、特徴デコーダーが、変形された特徴マップからの口の動きと、ソース特徴マップからの他の属性 (つまり、頭のポーズと上部の表情) を一緒に適応的に組み込む役割を果たします。最後に、DINet は、豊富なテクスチャの詳細で顔の視覚的な吹き替えを実現します。高解像度ビデオで DINet を検証するために、質的および量的比較を行います。実験結果は、私たちの方法が最先端の作品よりも優れていることを示しています。

For few-shot learning, it is still a critical challenge to realize photo-realistic face visually dubbing on high-resolution videos. Previous works fail to generate high-fidelity dubbing results. To address the above problem, this paper proposes a Deformation Inpainting Network (DINet) for high-resolution face visually dubbing. Different from previous works relying on multiple up-sample layers to directly generate pixels from latent embeddings, DINet performs spatial deformation on feature maps of reference images to better preserve high-frequency textural details. Specifically, DINet consists of one deformation part and one inpainting part. In the first part, five reference facial images adaptively perform spatial deformation to create deformed feature maps encoding mouth shapes at each frame, in order to align with the input driving audio and also the head poses of the input source images. In the second part, to produce face visually dubbing, a feature decoder is responsible for adaptively incorporating mouth movements from the deformed feature maps and other attributes (i.e., head pose and upper facial expression) from the source feature maps together. Finally, DINet achieves face visually dubbing with rich textural details. We conduct qualitative and quantitative comparisons to validate our DINet on high-resolution videos. The experimental results show that our method outperforms state-of-the-art works.

updated: Tue Mar 07 2023 15:39:54 GMT+0000 (UTC)

published: Tue Mar 07 2023 15:39:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト