Points2Sound: From mono to binaural audio using 3D point cloud scenes

Francesc Lluís; Vasileios Chatziioannou; Alex Hofmann

Points2Sound: 3D 点群シーンを使用したモノラルオーディオからバイノーラルオーディオへ

没入型アプリケーションの場合、仮想環境で人々に有意義な体験をもたらすためには、視覚的なサウンドと一致するバイノーラルサウンドを生成することが重要です。最近の研究では、2D 視覚情報をガイダンスとして使用して、モノラルオーディオからバイノーラルオーディオを合成するニューラルネットワークの使用の可能性が示されています。 3D 視覚情報でオーディオをガイドし、波形ドメインで操作することでこのアプローチを拡張すると、仮想オーディオシーンのより正確な聴覚化が可能になる可能性があります。私たちは、3D 点群シーンを使用してモノラルオーディオからバイノーラルバージョンを生成するマルチモーダルディープラーニングモデルである Points2Sound を提案します。具体的には、Points2Sound はビジョンネットワークとオーディオネットワークで構成されます。ビジョンネットワークは 3D スパース畳み込みを使用して、点群シーンから視覚的特徴を抽出します。次に、視覚機能によって、波形ドメインで動作するオーディオネットワークが調整され、バイノーラルバージョンが合成されます。結果は、3D 視覚情報がバイノーラル合成タスクのマルチモーダル深層学習モデルをうまく導くことができることを示しています。また、3D 点群の属性、学習目標、さまざまな残響条件、および数種類のモノラル混合信号が、シーン内に存在するさまざまな数の音源に対する Points2Sound のバイノーラルオーディオ合成パフォーマンスにどのような影響を与えるかを調査します。

For immersive applications, the generation of binaural sound that matches its visual counterpart is crucial to bring meaningful experiences to people in a virtual environment. Recent studies have shown the possibility of using neural networks for synthesizing binaural audio from mono audio by using 2D visual information as guidance. Extending this approach by guiding the audio with 3D visual information and operating in the waveform domain may allow for a more accurate auralization of a virtual audio scene. We propose Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consists of a vision network and an audio network. The vision network uses 3D sparse convolutions to extract a visual feature from the point cloud scene. Then, the visual feature conditions the audio network, which operates in the waveform domain, to synthesize the binaural version. Results show that 3D visual information can successfully guide multi-modal deep learning models for the task of binaural synthesis. We also investigate how 3D point cloud attributes, learning objectives, different reverberant conditions, and several types of mono mixture signals affect the binaural audio synthesis performance of Points2Sound for the different numbers of sound sources present in the scene.

updated: Fri May 19 2023 12:54:02 GMT+0000 (UTC)

published: Mon Apr 26 2021 10:44:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト