Dual-Path Cross-Modal Attention for better Audio-Visual Speech Extraction

Zhongweiyang Xu; Xulin Fan; Mark Hasegawa-Johnson

より優れた視聴覚音声抽出のためのデュアルパスクロスモーダルアテンション

唇の動きを見て、ノイズの多い混合物から特定の話者の音声を抽出することを目的とした視聴覚ターゲット音声抽出は、時間領域音声分離モデルと視覚的特徴抽出器 (CNN) を組み合わせることで大きな進歩を遂げました。オーディオとビデオの情報を融合する際の問題の 1 つは、それらの時間分解能が異なることです。現在のほとんどの研究では、視覚的特徴を時間次元に沿ってアップサンプリングして、オーディオとビデオの特徴を時間的に一致させることができます。ただし、唇の動きには、ほとんどの場合、長期的な、または電話レベルの情報が含まれている必要があると考えています。この仮定に基づいて、視聴覚機能を融合する新しい方法を提案します。 DPRNN dprnn の場合、インターチャンクディメンションの時間解像度がビデオフレームの時間解像度に非常に近いことがわかります。 sepformer と同様に、DPRNN の LSTM はチャンク内およびチャンク間の自己注意に置き換えられますが、提案されたアルゴリズムでは、チャンク間の注意は追加の機能ストリームとして視覚的機能を組み込みます。これにより、視覚的な合図のアップサンプリングが防止され、オーディオとビジュアルの融合がより効率的になります。結果は、他の時間領域ベースの視聴覚融合モデルと比較して優れた結果を達成していることを示しています。

Audio-visual target speech extraction, which aims to extract a certain speaker's speech from the noisy mixture by looking at lip movements, has made significant progress combining time-domain speech separation models and visual feature extractors (CNN). One problem of fusing audio and video information is that they have different time resolutions. Most current research upsamples the visual features along the time dimension so that audio and video features are able to align in time. However, we believe that lip movement should mostly contain long-term, or phone-level information. Based on this assumption, we propose a new way to fuse audio-visual features. We observe that for DPRNN dprnn, the interchunk dimension's time resolution could be very close to the time resolution of video frames. Like sepformer, the LSTM in DPRNN is replaced by intra-chunk and inter-chunk self-attention, but in the proposed algorithm, inter-chunk attention incorporates the visual features as an additional feature stream. This prevents the upsampling of visual cues, resulting in more efficient audio-visual fusion. The result shows we achieve superior results compared with other time-domain based audio-visual fusion models.

updated: Fri Mar 03 2023 21:01:32 GMT+0000 (UTC)

published: Sat Jul 09 2022 07:27:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト