Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors

Sindhu B Hegde; Rudrabha Mukhopadhyay; Vinay P Namboodiri; C. V. Jawahar

オーディオビジュアルプライアを使用した極端なスケールのトーキングフェイスビデオアップサンプリング

この論文では、8×8 ピクセルのビデオシーケンスから何が得られるかという興味深い問題を探ります。驚くべきことに、それはかなり多いことが判明しました。この 8×8 ビデオを適切なオーディオと画像の事前分布セットで処理すると、フルレングスの 256×256 ビデオが得られることを示します。新しいオーディオビジュアルアップサンプリングネットワークを使用して、非常に低解像度の入力のこの 32 倍のスケーリングを実現します。事前音声は、基本的な顔の詳細と正確な唇の形状を復元するのに役立ち、単一の高解像度のターゲット ID 事前画像は、豊富な外観の詳細を提供します。私たちのアプローチは、エンドツーエンドの多段階フレームワークです。第 1 段階では粗い中間出力ビデオを生成します。これを使用して、単一のターゲット ID 画像をアニメーション化し、現実的で正確かつ高品質の出力を生成できます。私たちのアプローチはシンプルで、以前の超解像法と比較して非常に優れています (FID スコアで 8 倍の改善)。また、モデルを話し顔のビデオ圧縮に拡張し、以前の最先端技術よりもビット/ピクセルで 3.5 倍の改善が得られることを示します。私たちのネットワークからの結果は、広範なアブレーション実験を通じて徹底的に分析されています（論文および補足資料）。また、当社の Web サイト (http://cvit.iiit.ac.in/research/projects/cvit-projects/talking-face-video-upsampling) でコードとモデルと共にデモビデオを提供しています。

In this paper, we explore an interesting question of what can be obtained from an 8×8 pixel video sequence. Surprisingly, it turns out to be quite a lot. We show that when we process this 8×8 video with the right set of audio and image priors, we can obtain a full-length, 256×256 video. We achieve this 32× scaling of an extremely low-resolution input using our novel audio-visual upsampling network. The audio prior helps to recover the elemental facial details and precise lip shapes and a single high-resolution target identity image prior provides us with rich appearance details. Our approach is an end-to-end multi-stage framework. The first stage produces a coarse intermediate output video that can be then used to animate single target identity image and generate realistic, accurate and high-quality outputs. Our approach is simple and performs exceedingly well (an 8× improvement in FID score) compared to previous super-resolution methods. We also extend our model to talking-face video compression, and show that we obtain a 3.5× improvement in terms of bits/pixel over the previous state-of-the-art. The results from our network are thoroughly analyzed through extensive ablation experiments (in the paper and supplementary material). We also provide the demo video along with code and models on our website: http://cvit.iiit.ac.in/research/projects/cvit-projects/talking-face-video-upsampling.

updated: Wed Aug 17 2022 07:19:40 GMT+0000 (UTC)

published: Wed Aug 17 2022 07:19:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト