Large-scale unsupervised audio pre-training for video-to-speech synthesis

Triantafyllos Kefalas; Yannis Panagakis; Maja Pantic

ビデオ音声合成のための大規模な教師なしオーディオ事前トレーニング

ビデオ音声合成は、話者の無声ビデオから音声信号を再構築するタスクです。これまでに確立されたアプローチのほとんどは 2 段階のプロセスを必要とし、最初にスペクトログラムなどのビデオから中間表現が抽出され、次にボコーダーに渡されて生のオーディオが生成されます。最近の研究の中には、生のオーディオと中間表現の生成が共同で実行されるエンドツーエンドの合成に焦点を当てたものもあります。このようなすべてのアプローチには、ほぼ独占的にオーディオビジュアルデータセットからのデータに対するトレーニングが含まれます。つまり、すべてのオーディオサンプルには対応するビデオサンプルがあります。これにより、対応する視覚モダリティを持たない可能性のある豊富な音声のみのデータセット (オーディオブック、ラジオポッドキャスト、音声認識データセットなど) や、音声機械学習コミュニティによって長年開発されてきた音声のみのアーキテクチャの使用が妨げられます。年。この論文では、24kHz で 3,500 時間以上のオーディオデータでエンコーダデコーダモデルをトレーニングし、事前トレーニングされたデコーダを使用してビデオ音声合成タスク用のオーディオデコーダを初期化することを提案します。事前トレーニングステップでは音声サンプルのみを使用し、他のモダリティ (ビジュアル、テキスト) からのラベルや対応するサンプルは必要ありません。我々は、この事前トレーニングステップによって再構成された音声が改善されること、およびこれがモダリティの 1 つからのサンプルのみを必要としながら、クロスモーダルタスクにおけるジェネレーターの品質を向上させる未開発の方法であることを実証します。私たちは、生のオーディオとメルスペクトログラムの両方をターゲット出力として使用して実験を行い、モデルを既存の研究でベンチマークします。

Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Most established approaches to date involve a two-step process, whereby an intermediate representation from the video, such as a spectrogram, is extracted first and then passed to a vocoder to produce the raw audio. Some recent work has focused on end-to-end synthesis, whereby the generation of raw audio and any intermediate representations is performed jointly. All such approaches involve training on data from almost exclusively audio-visual datasets, i.e. every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality (e.g. audiobooks, radio podcasts, speech recognition datasets etc.), as well as audio-only architectures that have been developed by the audio machine learning community over the years. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz, and then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task. The pre-training step uses audio samples only and does not require labels or corresponding samples from other modalities (visual, text). We demonstrate that this pre-training step improves the reconstructed speech and that it is an unexplored way to improve the quality of the generator in a cross-modal task while only requiring samples from one of the modalities. We conduct experiments using both raw audio and mel spectrograms as target outputs and benchmark our models with existing work.

updated: Tue Jun 27 2023 13:31:33 GMT+0000 (UTC)

published: Tue Jun 27 2023 13:31:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト