Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator

Xiaofeng Liu; Fangxu Xing; Jerry L. Prince; Jiachen Zhuo; Maureen Stone; Georges El Fakhri; Jonghye Woo

タグ付き-自己残余注意ガイド付き異種トランスレータを介したオーディオ合成へのMRIシーケンス

タグ付きMRIと了解度の高い音声に見られる舌と中咽頭の筋肉の変形の根本的な関係を理解することは、音声運動制御理論と音声関連障害の治療を進める上で重要な役割を果たします。ただし、表現が不均一であるため、2つのモダリティ（つまり、2次元（正中矢状スライス）と時間タグ付きMRIシーケンスおよび対応する1次元波形）間の直接マッピングは簡単ではありません。代わりに、ピッチと共振の両方を含む中間表現として2次元スペクトログラムを使用し、そこからエンドツーエンドの深層学習フレームワークを開発して、タグ付きMRIのシーケンスから対応するオーディオ波形に制限付きで変換します。データセットサイズ。〜私たちのフレームワークは、発話中に動く筋肉構造を具体的に活用するための自己残差注意戦略のガイダンスを備えた新しい完全畳み込み非対称トランスレータに基づいています。〜さらに、同じ発話を持つサンプルのペアワイズ相関を活用します。潜在的な空間表現の解きほぐし戦略。〜さらに、生成されたスペクトログラムのリアリズムを改善するために、敵対的生成ネットワークを使用した敵対的トレーニングアプローチを組み込んでいます。私たちのフレームワークは、タグ付けされたシーケンスからクリアなオーディオ波形の生成を可能にしました- MRI、競合する方法を超えています。したがって、私たちのフレームワークは、2つのモダリティ間の関係をよりよく理解するのに役立つ大きな可能性を提供します。

Understanding the underlying relationship between tongue and oropharyngeal muscle deformation seen in tagged-MRI and intelligible speech plays an important role in advancing speech motor control theories and treatment of speech related-disorders. Because of their heterogeneous representations, however, direct mapping between the two modalities -- i.e., two-dimensional (mid-sagittal slice) plus time tagged-MRI sequence and its corresponding one-dimensional waveform -- is not straightforward. Instead, we resort to two-dimensional spectrograms as an intermediate representation, which contains both pitch and resonance, from which to develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size.~Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy to specifically exploit the moving muscular structures during speech.~In addition, we leverage a pairwise correlation of the samples with the same utterances with a latent space representation disentanglement strategy.~Furthermore, we incorporate an adversarial training approach with generative adversarial networks to offer improved realism on our generated spectrograms.~Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms from a sequence of tagged-MRI, surpassing competing methods. Thus, our framework provides the great potential to help better understand the relationship between the two modalities.

updated: Thu Jun 09 2022 16:27:16 GMT+0000 (UTC)

published: Sun Jun 05 2022 23:08:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト