Visual Scene Graphs for Audio Source Separation

Moitreya Chatterjee; Jonathan Le Roux; Narendra Ahuja; Anoop Cherian

オーディオソース分離のためのビジュアルシーングラフ

視覚的にガイドされたオーディオソース分離のための最先端のアプローチは、通常、楽器などの特徴的なサウンドを持つソースを想定しています。これらのアプローチは、多くの場合、これらの音源の視覚的コンテキストを無視するか、特に同じオブジェクトクラスが異なる相互作用からさまざまな音を生成する場合に、音源をより適切に特徴付けるために役立つ可能性のあるオブジェクト相互作用のモデリングを回避します。この困難な問題に対処するために、オーディオビジュアルシーングラフセグメンター（AVSGS）を提案します。これは、シーンの視覚構造をグラフとして埋め込み、このグラフをサブグラフにセグメント化する新しい深層学習モデルです。各サブグラフは、オーディオスペクトログラムのコセグメント化。 AVSGSは、そのコアで、マルチヘッドアテンションを使用して視覚グラフの相互に直交するサブグラフ埋め込みを発行する再帰型ニューラルネットワークを使用します。これらの埋め込みは、オーディオエンコーダー-デコーダーをソース分離に向けて調整するために使用されます。私たちのパイプラインは、人工的に混合されたサウンドからビジュアルグラフを使用してオーディオソースを分離することで構成される自己監視タスクを介してエンドツーエンドでトレーニングされます。このホワイトペーパーでは、複数の非音楽ソースを含む音源分離用の「インザワイルド」ビデオデータセットも紹介します。これをオーディオセパレーションインザワイルド（ASIW）と呼びます。このデータセットはAudioCapsデータセットから採用されています。提案されたASIWと標準のMUSICデータセットでの徹底的な実験は、最近の以前のアプローチに対する私たちの方法の最先端の音分離性能を示しています。

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to better characterize the sources, especially when the same object class may produce varied sounds from distinct interactions. To address this challenging problem, we propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs, each subgraph being associated with a unique sound obtained by co-segmenting the audio spectrogram. At its core, AVSGS uses a recursive neural network that emits mutually-orthogonal sub-graph embeddings of the visual graph using multi-head attention. These embeddings are used for conditioning an audio encoder-decoder towards source separation. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds. In this paper, we also introduce an "in the wild'' video dataset for sound source separation that contains multiple non-musical sources, which we call Audio Separation in the Wild (ASIW). This dataset is adapted from the AudioCaps dataset, and provides a challenging, natural, and daily-life setting for source separation. Thorough experiments on the proposed ASIW and the standard MUSIC datasets demonstrate state-of-the-art sound separation performance of our method against recent prior approaches.

updated: Fri Sep 24 2021 13:40:51 GMT+0000 (UTC)

published: Fri Sep 24 2021 13:40:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト