Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations

Lingyu Zhu; Esa Rahtu

自己監視モーション表現を使用した視覚的にガイドされた音源の分離とローカリゼーション

この論文の目的は、視聴覚音源分離を実行すること、すなわち、音源のビデオに基づいて混合物からコンポーネントオーディオを分離することです。さらに、入力ビデオシーケンス内のソースの場所を正確に特定することを目的としています。最近の作品では、ソースタイプ（人間の演奏楽器など）と事前にトレーニングされたモーションディテクタ（キーポイントやオプティカルフローなど）の事前知識を使用すると、印象的な視聴覚分離の結果が示されています。ただし、同時に、モデルは特定のアプリケーションドメインに制限されます。このホワイトペーパーでは、これらの制限に対処し、次の貢献をします。i）外観とモーションネットワーク（AMnet）と呼ばれる2ステージのアーキテクチャを提案します。このアーキテクチャでは、ステージはそれぞれ外観とモーションの手がかりに特化しています。システム全体は、自己監視方式でトレーニングされます。 ii）音声に関連するモーションを明示的に表すために、オーディオモーション埋め込み（AME）フレームワークを導入します。 iii）オーディオとモーション機能の融合のためのオーディオモーショントランスフォーマーアーキテクチャを提案します。 iv）事前にトレーニングされたキーポイント検出器やオプティカルフロー推定器を使用していないにもかかわらず、2つの挑戦的なデータセット（MUSIC-21とAVE）で最先端のパフォーマンスを実証します。プロジェクトページ：https：//ly-zhu.github.io/self-supervised-motion-representations

The objective of this paper is to perform audio-visual sound source separation, i.e.~to separate component audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint the source location in the input video sequence. Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type (e.g. human playing instrument) and pre-trained motion detectors (e.g. keypoints or optical flows). However, at the same time, the models are limited to a certain application domain. In this paper, we address these limitations and make the following contributions: i) we propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues, respectively. The entire system is trained in a self-supervised manner; ii) we introduce an Audio-Motion Embedding (AME) framework to explicitly represent the motions that related to sound; iii) we propose an audio-motion transformer architecture for audio and motion feature fusion; iv) we demonstrate state-of-the-art performance on two challenging datasets (MUSIC-21 and AVE) despite the fact that we do not use any pre-trained keypoint detectors or optical flow estimators. Project page: https://ly-zhu.github.io/self-supervised-motion-representations

updated: Sat Apr 17 2021 10:09:15 GMT+0000 (UTC)

published: Sat Apr 17 2021 10:09:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト