A Light Weight Model for Active Speaker Detection

Junhua Liao; Haihan Duan; Kanghui Feng; Wanbing Zhao; Yanbing Yang; Liangyin Chen

アクティブスピーカー検出の軽量モデル

アクティブスピーカーの検出は、1 つまたは複数のスピーカーシナリオで誰が話しているかを検出することを目的とした、視聴覚シナリオの理解における困難なタスクです。このタスクは、話者のダイアライゼーション、話者の追跡、自動ビデオ編集などのアプリケーションで重要であるため、大きな注目を集めています。既存の研究では、複数の候補情報を入力し、複雑なモデルを設計することでパフォーマンスを改善しようとしています。これらの方法は優れたパフォーマンスを達成しましたが、メモリと計算能力を大量に消費するため、リソースが限られたシナリオに適用することは困難です。したがって、入力候補を削減し、オーディオビジュアル特徴抽出のために 2D と 3D の畳み込みを分割し、クロスモーダルモデリングに計算の複雑さが低いゲートリカレントユニット (GRU) を適用することにより、軽量のアクティブスピーカー検出アーキテクチャを構築します。 AVA-ActiveSpeaker データセットの実験結果は、私たちのフレームワークが競争力のある mAP パフォーマンス (94.1% 対 94.2%) を達成する一方で、特にモデルパラメーター (1.0M) において、リソースコストが最先端の方法よりも大幅に低いことを示しています対 22.5M、約 23 倍) および FLOP (0.6G 対 2.6G、約 4 倍)。さらに、私たちのフレームワークはコロンビアのデータセットでもうまく機能し、優れた堅牢性を示しています。コードとモデルの重みは、https://github.com/Junhua-Liao/Light-ASD で入手できます。

Active speaker detection is a challenging task in audio-visual scenario understanding, which aims to detect who is speaking in one or more speakers scenarios. This task has received extensive attention as it is crucial in applications such as speaker diarization, speaker tracking, and automatic video editing. The existing studies try to improve performance by inputting multiple candidate information and designing complex models. Although these methods achieved outstanding performance, their high consumption of memory and computational power make them difficult to be applied in resource-limited scenarios. Therefore, we construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%), while the resource costs are significantly lower than the state-of-the-art method, especially in model parameters (1.0M vs. 22.5M, about 23x) and FLOPs (0.6G vs. 2.6G, about 4x). In addition, our framework also performs well on the Columbia dataset showing good robustness. The code and model weights are available at https://github.com/Junhua-Liao/Light-ASD.

updated: Wed Mar 08 2023 08:40:56 GMT+0000 (UTC)

published: Wed Mar 08 2023 08:40:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト