Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

Joanna Hong; Minsu Kim; Jeongsoo Choi; Yong Man Ro

見るか聞くか: 視覚的破損モデリングと信頼性スコアリングによるロバストなオーディオビジュアル音声認識

この論文では、オーディオ入力とビジュアル入力の両方が破損しているマルチモーダル入力破損状況下でのオーディオビジュアル音声認識 (AVSR) を扱います。これは、以前の研究の方向性では十分に対処されていません。以前の研究では、クリーンなビジュアル入力が利用できるという前提で、破損したオーディオ入力をクリーンなビジュアル入力で補完する方法に焦点が当てられていました。ただし、実際には、きれいな視覚入力に常にアクセスできるとは限らず、唇の領域が塞がれたりノイズが発生したりして、視覚入力が損なわれることさえあります。したがって、最初に、以前の AVSR モデルは、ユニモーダルモデルと比較して、マルチモーダル入力ストリーム、オーディオおよびビジュアル入力の破損に対して実際には堅牢ではないことを分析します。次に、マルチモーダル入力破損モデリングを設計して、堅牢な AVSR モデルを開発します。最後に、破損したマルチモーダル入力に対して堅牢な新しい AVSR フレームワーク、つまりオーディオビジュアル信頼性スコアリングモジュール (AV-RelScore) を提案します。 AV-RelScore は、どの入力モーダルストリームが予測に信頼できるかどうかを判断でき、より信頼性の高いストリームを予測に活用することもできます。提案された方法の有効性は、一般的なベンチマークデータベース LRS2 および LRS3 での包括的な実験で評価されます。また、AV-RelScore によって得られた信頼性スコアが破損の程度をよく反映し、提案されたモデルが信頼できるマルチモーダル表現に焦点を合わせていることも示します。

This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input corruption situations where audio inputs and visual inputs are both corrupted, which is not well addressed in previous research directions. Previous studies have focused on how to complement the corrupted audio inputs with the clean visual inputs with the assumption of the availability of clean visual inputs. However, in real life, clean visual inputs are not always accessible and can even be corrupted by occluded lip regions or noises. Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models. Then, we design multimodal input corruption modeling to develop robust AVSR models. Lastly, we propose a novel AVSR framework, namely Audio-Visual Reliability Scoring module (AV-RelScore), that is robust to the corrupted multimodal inputs. The AV-RelScore can determine which input modal stream is reliable or not for the prediction and also can exploit the more reliable streams in prediction. The effectiveness of the proposed method is evaluated with comprehensive experiments on popular benchmark databases, LRS2 and LRS3. We also show that the reliability scores obtained by AV-RelScore well reflect the degree of corruption and make the proposed model focus on the reliable multimodal representations.

updated: Wed Mar 15 2023 11:29:36 GMT+0000 (UTC)

published: Wed Mar 15 2023 11:29:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト