Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

Rishabh Garg; Ruohan Gao; Kristen Grauman

ビデオからのバイノーラルオーディオ生成のためのジオメトリ対応マルチタスク学習

バイノーラルオーディオは、人間のリスナーに没入型の空間サウンド体験を提供しますが、ほとんどの既存のビデオにはバイノーラルオーディオ録音がありません。ビデオの視覚情報を利用して、モノラル（シングルチャネル）オーディオをバイノーラルオーディオに変換するオーディオ空間化手法を提案します。既存のアプローチはビデオフレームから直接抽出された視覚的特徴を活用しますが、私たちのアプローチは視覚ストリームに存在する幾何学的な手がかりを明示的に解きほぐして学習プロセスを導きます。特に、基礎となる部屋のインパルス応答、音源の位置とのビジュアルストリームのコヒーレンス、およびサウンドのジオメトリの一貫性を考慮して、バイノーラルオーディオ生成のジオメトリ対応機能を学習するマルチタスクフレームワークを開発します。時間の経過とともにオブジェクト。さらに、実際のスキャン環境用にシミュレートされたリアルなバイノーラルオーディオを備えた新しい大規模なビデオデータセットを紹介します。 2つのデータセットで、最先端の結果を達成する方法の有効性を示します。

Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process. In particular, we develop a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream's coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time. Furthermore, we introduce a new large video dataset with realistic binaural audio simulated for real-world scanned environments. On two datasets, we demonstrate the efficacy of our method, which achieves state-of-the-art results.

updated: Sun Nov 21 2021 19:26:45 GMT+0000 (UTC)

published: Sun Nov 21 2021 19:26:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト