Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Wei-Ning Hsu; David Harwath; Christopher Song; James Glass

学習された分節単位を使用したテキストフリーの画像から音声への合成

この論文では、中間表現または監督のソースとして自然言語テキストを必要としない画像の流暢で自然な音声の音声キャプションを直接合成するための最初のモデルを提示します。代わりに、画像キャプションモジュールと音声合成モジュールを、自己監視型の視覚的接地タスクで検出された一連の個別のサブワード音声ユニットに接続します。人気のあるMSCOCOデータセット用に収集された音声キャプションの新しいコーパスに加えて、Flickr8k音声キャプションデータセットで実験を行い、生成されたキャプションが、それらが記述する画像の多様な視覚的セマンティクスもキャプチャすることを示します。いくつかの異なる中間音声表現を調査し、その表現がテキストのドロップイン置換として機能するためにいくつかの重要なプロパティを満たさなければならないことを経験的に見つけました。

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.

updated: Thu Dec 31 2020 05:28:38 GMT+0000 (UTC)

published: Thu Dec 31 2020 05:28:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト