LoCoNet: Long-Short Context Network for Active Speaker Detection

Xizi Wang; Feng Cheng; Gedas Bertasius; David Crandall

LoCoNet: アクティブスピーカー検出のためのロングショートコンテキストネットワーク

アクティブスピーカー検出 (ASD) は、ビデオの各フレームで誰が話しているかを識別することを目的としています。 2 つのコンテキストからのオーディオおよびビジュアル情報からの ASD の理由: 長期的な話者内コンテキストと短期的な話者間コンテキスト。長期的な話者内コンテキストは、同じ話者の時間的な依存関係をモデル化し、短期的な話者間コンテキストは、同じシーン内の話者の相互作用をモデル化します。これら 2 つのコンテキストは互いに補完し合い、アクティブスピーカーを推測するのに役立ちます。これらの観察に動機付けられて、長期的な話者内コンテキストと短期的な話者間コンテキストをモデル化するシンプルで効果的なロングショートコンテキストネットワークである LoCoNet を提案します。私たちは、長期的な依存関係をモデル化する上で有効であるため、自己注意を使用して長期的な話者内コンテキストをモデル化し、局所パターンをキャプチャして短期的な話者間コンテキストをモデル化する畳み込みブロックを使用します。広範な実験により、LoCoNet は複数のデータセットで最先端のパフォーマンスを達成し、AVA-ActiveSpeaker で 95.2%(+1.1%)、Columbia データセットで 68.1%(+22%)、97.2%(+2.8%) の mAP を達成することが示されています。 %) は Talkies データセットで、59.7% (+8.0%) は Ego4D データセットでした。さらに、複数のスピーカーが存在する、またはアクティブなスピーカーの面が同じシーン内の他の面よりもはるかに小さいという困難なケースでは、LoCoNet は、AVA-ActiveSpeaker データセットで以前の最先端の方法よりも 3.4% 優れています。コードは https://github.com/SJTUwxz/LoCoNet_ASD で公開されます。

Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. ASD reasons from audio and visual information from two contexts: long-term intra-speaker context and short-term inter-speaker context. Long-term intra-speaker context models the temporal dependencies of the same speaker, while short-term inter-speaker context models the interactions of speakers in the same scene. These two contexts are complementary to each other and can help infer the active speaker. Motivated by these observations, we propose LoCoNet, a simple yet effective Long-Short Context Network that models the long-term intra-speaker context and short-term inter-speaker context. We use self-attention to model long-term intra-speaker context due to its effectiveness in modeling long-range dependencies, and convolutional blocks that capture local patterns to model short-term inter-speaker context. Extensive experiments show that LoCoNet achieves state-of-the-art performance on multiple datasets, achieving an mAP of 95.2%(+1.1%) on AVA-ActiveSpeaker, 68.1%(+22%) on Columbia dataset, 97.2%(+2.8%) on Talkies dataset and 59.7%(+8.0%) on Ego4D dataset. Moreover, in challenging cases where multiple speakers are present, or face of active speaker is much smaller than other faces in the same scene, LoCoNet outperforms previous state-of-the-art methods by 3.4% on the AVA-ActiveSpeaker dataset. The code will be released at https://github.com/SJTUwxz/LoCoNet_ASD.

updated: Fri Mar 29 2024 22:29:03 GMT+0000 (UTC)

published: Thu Jan 19 2023 18:54:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト