Cross-modal Face- and Voice-style Transfer

Naoya Takahashi; Mayank K. Singh; Yuki Mitsufuji

顔スタイルと音声スタイルのクロスモーダル転送

画像から画像への変換と音声変換により、画像内のポーズや音声内の言語コンテンツなどのセマンティクスの一部をそれぞれ維持しながら、新しい顔画像と音声を生成できます。多くのアプリケーションでコンテンツ作成プロセスを支援できます。ただし、各モダリティ内での変換に限定されるため、生成された顔と声の印象を一致させることは未解決の問題です。 XFaVoT というクロスモーダルスタイル転送フレームワークを提案します。このフレームワークは、画像翻訳と音声変換タスクの 4 つのタスクを音声または画像ガイダンスと共に学習し、「与えられた声に一致する顔」と「与えられた顔に一致する声」の生成を可能にします。 "、および単一のフレームワークによるモダリティ内翻訳タスク。複数のデータセットでの実験結果は、XFaVoT が画像と音声のクロスモーダルスタイルの変換を実現し、品質、多様性、および顔と音声の対応に関してベースラインを上回ることを示しています。

Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face and voice remains an open question. We propose a cross-modal style transfer framework called XFaVoT that jointly learns four tasks: image translation and voice conversion tasks with audio or image guidance, which enables the generation of ``face that matches given voice" and ``voice that matches given face", and intra-modality translation tasks with a single framework. Experimental results on multiple datasets show that XFaVoT achieves cross-modal style translation of image and voice, outperforming baselines in terms of quality, diversity, and face-voice correspondence.

updated: Wed Mar 01 2023 14:50:41 GMT+0000 (UTC)

published: Mon Feb 27 2023 14:39:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト