Intel Labs at Ego4D Challenge 2022: A Better Baseline for Audio-Visual Diarization

Kyle Min

Intel Labs at Ego4D Challenge 2022: オーディオビジュアルダイアライゼーションのより良いベースライン

このレポートでは、Ego4D チャレンジ 2022 のオーディオビジュアルダイアライゼーション (AVD) タスクに対する私たちのアプローチについて説明します。具体的には、公式ベースラインに対する複数の技術的改善を提示します。まず、モデルのトレーニングスキームを変更することにより、カメラ装着者の音声アクティビティの検出パフォーマンスを向上させます。次に、市販の音声アクティビティ検出モデルをカメラ装着者の音声アクティビティのみに適用すると、偽陽性を効果的に除去できることを発見しました。最後に、アクティブスピーカーの検出が向上すると、AVD の結果が向上することを示します。最終的な方法では、Ego4D のテストセットで 65.9% の DER が得られ、すべてのベースラインよりも大幅に優れています。私たちの投稿は、Ego4D Challenge 2022 で 1 位を獲得しました。

This report describes our approach for the Audio-Visual Diarization (AVD) task of the Ego4D Challenge 2022. Specifically, we present multiple technical improvements over the official baselines. First, we improve the detection performance of the camera wearer's voice activity by modifying the training scheme of its model. Second, we discover that an off-the-shelf voice activity detection model can effectively remove false positives when it is applied solely to the camera wearer's voice activities. Lastly, we show that better active speaker detection leads to a better AVD outcome. Our final method obtains 65.9% DER on the test set of Ego4D, which significantly outperforms all the baselines. Our submission achieved 1st place in the Ego4D Challenge 2022.

updated: Fri Oct 14 2022 12:54:03 GMT+0000 (UTC)

published: Fri Oct 14 2022 12:54:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト