STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Kazuki Shimada; Archontis Politis; Parthasaarathy Sudarsanam; Daniel Krause; Kengo Uchida; Sharath Adavanne; Aapo Hakala; Yuichiro Koyama; Naoya Takahashi; Shusuke Takahashi; Tuomas Virtanen; Yuki Mitsufuji

STARSS23: 音声イベントの時空間注釈を備えた実際のシーンの空間録音のオーディオビジュアルデータセット

音声イベントの到来方向 (DOA) は通常、マイクロフォンアレイに記録されたマルチチャネルオーディオデータから推定されますが、音声イベントは通常、視覚的に知覚可能なソースオブジェクト (歩行者の足音など) から派生します。この論文は、マルチチャネルオーディオおよびビデオ情報を使用して、ターゲットサウンドイベントの時間的活性化とDOAを推定するオーディオビジュアルサウンドイベント位置特定および検出（SELD）タスクを提案します。オーディオビジュアル SELD システムは、マイクロフォンアレイからの信号とオーディオビジュアル対応を使用してサウンドイベントを検出し、位置を特定できます。また、オーディオビジュアルデータセットである Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) も紹介します。これは、マイクアレイで記録されたマルチチャネルオーディオデータ、ビデオデータ、およびサウンドイベントの時空間アノテーションで構成されます。 STARSS23 のサウンドシーンは、適切なアクティビティとサウンドイベントの発生を確保するために録音参加者をガイドする指示とともに記録されます。 STARSS23 は、モーションキャプチャシステムの追跡結果に基づいて、人間が注釈を付けた時間的アクティベーションラベルと人間が確認した DOA ラベルも提供します。私たちのベンチマーク結果は、オーディオビジュアル SELD システムがオーディオのみのシステムよりも低い定位誤差を達成することを示しています。データは https://zenodo.org/record/7880637 で入手できます。

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results show that the audio-visual SELD system achieves lower localization error than the audio-only system. The data is available at https://zenodo.org/record/7880637.

updated: Thu Jun 15 2023 13:37:14 GMT+0000 (UTC)

published: Thu Jun 15 2023 13:37:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト