Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

Chen Chen; Yuchen Hu; Qiang Zhang; Heqing Zou; Beier Zhu; Eng Siong Chng

強化学習による視聴覚音声認識のためのモダリティ固有の表現の活用

視聴覚音声認識 (AVSR) は、音声認識のノイズロバスト性を改善するために目覚ましい成功を収めています。主流の方法は、音声入力と視覚入力を融合してモダリティ不変の表現を取得することに重点を置いています。ただし、このような表現は、きれいな状態ではビデオモダリティよりもはるかに認識しやすいため、オーディオモダリティに過度に依存する傾向があります。その結果、AVSR モデルは、ノイズの破損に直面した場合のビジュアルストリームの重要性を過小評価しています。この目的のために、ビジュアルモダリティ固有の表現を活用して、AVSR タスクに安定した補足情報を提供します。具体的には、MSRL と呼ばれる強化学習 (RL) ベースのフレームワークを提案します。このフレームワークでは、エージェントが自己回帰デコードプロセスでモダリティ不変表現とモダリティ固有の表現を動的に調和させます。タスク固有のメトリック (単語エラー率など) に直接関連する報酬関数をカスタマイズして、MSRL が最適な統合戦略を効果的に探索できるようにします。 LRS3 データセットの実験結果は、提案された方法がクリーンな条件とさまざまなノイズの多い条件の両方で最先端を達成することを示しています。さらに、テストセットに目に見えないノイズが含まれている場合、他のベースラインよりも MSRL システムの一般性が優れていることを示します。

Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

updated: Thu Feb 02 2023 09:30:00 GMT+0000 (UTC)

published: Sat Dec 10 2022 14:01:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト