Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Meng Liu; Kong Aik Lee; Longbiao Wang; Hanyi Zhang; Chang Zeng; Jianwu Dang

テキストに依存しない話者検証のためのクロスモーダル視聴覚共同学習

視覚的発話 (すなわち、唇の動き) は、発話生成における共起と同期により、聴覚的発話と高度に関連しています。この論文では、この相関関係を調査し、クロスモーダル音声共同学習パラダイムを提案します。私たちのクロスモーダル共同学習方法の主な動機は、別のモダリティからの知識を活用することによって支援される 1 つのモダリティをモデル化することです。具体的には、2 つのクロスモーダルブースターは、モダリティ変換された相関関係を学習する視聴覚疑似シャム構造に基づいて導入されます。各ブースター内では、モダリティの調整と強化された機能生成のために、max-feature-map が埋め込まれた Transformer バリアントが提案されています。ネットワークは、ゼロから、および事前トレーニング済みモデルの両方で共同学習されます。 LRSLip3、GridLip、LomGridLip、および VoxLip データセットの実験結果は、提案された方法が、個別にトレーニングされたオーディオのみ/ビジュアルのみおよびベースラインフュージョンシステムに対して、それぞれ 60% および 20% の平均相対パフォーマンス向上を達成することを示しています。

Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively.

updated: Wed Feb 22 2023 10:06:37 GMT+0000 (UTC)

published: Wed Feb 22 2023 10:06:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト