V2C: Visual Voice Cloning

Qi Chen; Yuanqing Li; Yuankai Qi; Jiaqiu Zhou; Mingkui Tan; Qi Wu

V2C：ビジュアルボイスクローニング

既存の音声クローニング（VC）タスクは、段落テキストを、参照音声で指定された目的の音声の音声に変換することを目的としています。これにより、人工音声アプリケーションの開発が大幅に促進されました。ただし、映画の吹き替えなど、これらのVCタスクではうまく反映できないシナリオも多数存在します。この場合、スピーチは映画のプロットと一致する感情を持っている必要があります。このギャップを埋めるために、この作業では、テキストの段落を参照オーディオで指定された目的の音声と参照ビデオで指定された目的の感情の両方を含む音声に変換するVisual Voice Cloning（V2C）という新しいタスクを提案します。この分野の研究を促進するために、データセットV2C-Animationを構築し、既存の最先端（SoTA）VC技術に基づいた強力なベースラインを提案します。私たちのデータセットには、さまざまなジャンル（コメディ、ファンタジーなど）と感情（幸せ、悲しいなど）をカバーする10,217のアニメーションムービークリップが含まれています。さらに、MCD-DTW-SLという名前の一連の評価指標を設計します。これは、グラウンドトゥルース音声と合成音声の類似性を評価するのに役立ちます。広範な実験結果は、SoTA VCメソッドでさえ、V2Cタスクに対して満足のいくスピーチを生成できないことを示しています。提案された新しいタスクが、構築されたデータセットと評価指標とともに、音声クローニングの分野とより広いビジョンと言語のコミュニティでの研究を促進することを願っています。

Existing Voice Cloning (VC) tasks aim to convert a paragraph text to a speech with desired voice specified by a reference audio. This has significantly boosted the development of artificial speech applications. However, there also exist many scenarios that cannot be well reflected by these VC tasks, such as movie dubbing, which requires the speech to be with emotions consistent with the movie plots. To fill this gap, in this work we propose a new task named Visual Voice Cloning (V2C), which seeks to convert a paragraph of text to a speech with both desired voice specified by a reference audio and desired emotion specified by a reference video. To facilitate research in this field, we construct a dataset, V2C-Animation, and propose a strong baseline based on existing state-of-the-art (SoTA) VC techniques. Our dataset contains 10,217 animated movie clips covering a large variety of genres (e.g., Comedy, Fantasy) and emotions (e.g., happy, sad). We further design a set of evaluation metrics, named MCD-DTW-SL, which help evaluate the similarity between ground-truth speeches and the synthesised ones. Extensive experimental results show that even SoTA VC methods cannot generate satisfying speeches for our V2C task. We hope the proposed new task together with the constructed dataset and evaluation metric will facilitate the research in the field of voice cloning and the broader vision-and-language community.

updated: Thu Nov 25 2021 03:35:18 GMT+0000 (UTC)

published: Thu Nov 25 2021 03:35:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト