DurIAN: Duration Informed Attention Network For Multimodal Synthesis

Chengzhu Yu; Heng Lu; Na Hu; Meng Yu; Chao Weng; Kun Xu; Peng Liu; Deyi Tuo; Shiyin Kang; Guangzhi Lei; Dan Su; Dong Yu

DurIAN：マルチモーダル合成のための持続時間インフォームドアテンションネットワーク

この論文では、非常に自然な音声と表情を同時に生成する、汎用的で堅牢なマルチモーダル合成システムを紹介します。このシステムの重要なコンポーネントは、入力テキストと出力音響特徴間のアライメントが継続時間モデルから推測される自己回帰モデルである継続時間通知アテンションネットワーク（DurIAN）です。これは、使用されているエンドツーエンドのアテンションメカニズムとは異なり、Tacotronなどの既存のエンドツーエンドの音声合成システムで避けられないさまざまなアーティファクトを考慮しています。さらに、DurIANを使用して高品質の顔の表情を生成できます。これは、生成されたスピーチと、パラレルスピーチおよび顔データの有無にかかわらず同期できます。音声生成の効率を改善するために、WaveRNNモデルの上にマルチバンド並列生成戦略も提案します。提案されたマルチバンドWaveRNNは、合計計算の複雑さを9.8 GFLOPSから5.5 GFLOPSに効果的に削減し、単一のCPUコアでリアルタイムの6倍の速度でオーディオを生成できます。 DurIANは、現在の最先端のエンドツーエンドシステムと同等の非常に自然な音声を生成できると同時に、これらのシステムでの単語のスキップ/繰り返しエラーを回避できることを示します。最後に、スピーチと表情の表現力をきめ細かく制御するためのシンプルで効果的なアプローチが紹介されています。

In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously. The key component of this system is the Duration Informed Attention Network (DurIAN), an autoregressive model in which the alignments between the input text and the output acoustic features are inferred from a duration model. This is different from the end-to-end attention mechanism used, and accounts for various unavoidable artifacts, in existing end-to-end speech synthesis systems such as Tacotron. Furthermore, DurIAN can be used to generate high quality facial expression which can be synchronized with generated speech with/without parallel speech and face data. To improve the efficiency of speech generation, we also propose a multi-band parallel generation strategy on top of the WaveRNN model. The proposed Multi-band WaveRNN effectively reduces the total computational complexity from 9.8 to 5.5 GFLOPS, and is able to generate audio that is 6 times faster than real time on a single CPU core. We show that DurIAN could generate highly natural speech that is on par with current state of the art end-to-end systems, while at the same time avoid word skipping/repeating errors in those systems. Finally, a simple yet effective approach for fine-grained control of expressiveness of speech and facial expression is introduced.

updated: Thu Sep 05 2019 22:35:53 GMT+0000 (UTC)

published: Wed Sep 04 2019 11:35:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト