On the Role of Visual Cues in Audiovisual Speech Enhancement

Zakaria Aldeneh; Anushree Prasanna Kumar; Barry-John Theobald; Erik Marchi; Sachin Kajarekar; Devang Naik; Ahmed Hussen Abdelaziz

視聴覚音声強調における視覚的手がかりの役割について

視聴覚音声強調モデルの内省を提示します。特に、神経視聴覚音声強調モデルが視覚的手がかりを使用してターゲット音声信号の品質を改善する方法の解釈に焦点を当てています。視覚的手がかりは、発話活動に関する高レベルの情報、すなわち発話/沈黙だけでなく、調音の場所に関するきめ細かい視覚情報も提供することを示します。この発見の副産物の1つは、学習した視覚的埋め込みを他の視覚的音声アプリケーションの機能として使用できることです。学習した視覚的埋め込みが音素を分類するための有効性を示します（音素との視覚的類似性）。私たちの結果は、視聴覚音声強調の重要な側面への洞察を提供し、そのようなモデルが視覚音声アプリケーションの自己監視タスクにどのように使用できるかを示しています。

We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of articulation. One byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications. We demonstrate the effectiveness of the learned visual embeddings for classifying visemes (the visual analogy to phonemes). Our results provide insight into important aspects of audiovisual speech enhancement and demonstrate how such models can be used for self-supervision tasks for visual speech applications.

updated: Thu Feb 25 2021 15:56:42 GMT+0000 (UTC)

published: Sat Apr 25 2020 01:00:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト