Rethinking Audio-visual Synchronization for Active Speaker Detection

Abudukelimu Wuerkaixi; You Zhang; Zhiyao Duan; Changshui Zhang

アクティブスピーカー検出のための視聴覚同期の再考

アクティブスピーカー検出（ASD）システムは、マルチトーカーの会話を分析するための重要なモジュールです。それらは、どのスピーカーが視覚シーンで話しているか、または誰も話していないかをいつでも検出することを目的としています。 ASDに関する既存の研究は、アクティブスピーカーの定義に同意していません。この作品の定義を明確にし、音声と視覚のスピーキング活動の間の同期を必要とします。この定義の明確化は、既存のASD手法が視聴覚同期のモデル化に失敗し、同期されていないビデオをアクティブスピーキングとして分類することが多いことを発見した広範な実験によって動機付けられています。この問題に対処するために、クロスモーダル対照学習戦略を提案し、教師ありASDモデルの注意モジュールに位置エンコーディングを適用して同期キューを活用します。実験結果は、私たちのモデルが、現在のモデルの制限に対処して、非同期の発話を非発話として正常に検出できることを示唆しています。

Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.

updated: Sun Jul 10 2022 05:52:31 GMT+0000 (UTC)

published: Tue Jun 21 2022 14:19:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト