A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer

Wei-Ning Hsu; Bowen Shi

多くの音声モダリティに対する単一の自己監視モデルにより、ゼロショットモダリティ転送が可能になります

視聴覚音声モデルは、音声のみのモデルと比較して優れたパフォーマンスと堅牢性を生み出すことができますが、ラベル付きおよびラベルなしの視聴覚データの欠如と、モダリティごとに1つのモデルを展開するコストによって、その開発と採用が妨げられています。この論文では、統一されたマスクされたクラスター予測の目的でマルチモーダルとユニモーダルの両方の音声を活用できる自己監視型の事前トレーニングフレームワークであるu-HuBERTを紹介します。事前トレーニング中にモダリティドロップアウトを利用することにより、単一の微調整されたモデルが、最先端のモダリティ固有のモデルと同等またはそれ以上のパフォーマンスを達成できることを示します。さらに、音声のみで微調整されたモデルは、視聴覚および視覚音声入力でうまく機能し、音声認識および話者検証のためのゼロショットモダリティの一般化を実現します。特に、私たちの単一モデルは、オーディオビジュアル/オーディオ/ビジュアル入力を備えたLRS3で1.2％/ 1.4％/ 27.2％の音声認識単語誤り率をもたらします。

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for speech recognition and speaker verification. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input.

updated: Thu Jul 14 2022 16:21:33 GMT+0000 (UTC)

published: Thu Jul 14 2022 16:21:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト