Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention

Efthymios Tzinis; Scott Wisdom; Tal Remez; John R. Hershey

オーディオビジュアルセルフアテンションを使用したオープンドメインビデオの画面上のサウンド分離の改善

最先端のオーディオビジュアルオンスクリーンサウンドセパレーションシステムを紹介します。このシステムは、実際のビデオを見ることで、サウンドを分離し、それらを画面上のオブジェクトに関連付けることを学習できます。時空間的注意の単純さと粗い解像度、オーディオ分離モデルの不十分な収束など、オーディオビジュアルオンスクリーンサウンド分離に関する以前の作業の制限を特定します。私たちが提案するモデルは、時間の経過とともにより細かい解像度で視聴覚依存関係をキャプチャするクロスモーダルおよび自己注意モジュールを使用して、および音声分離モデルの教師なし事前トレーニングによって、これらの問題に対処します。これらの改善により、モデルをより幅広い未表示のビデオのセットに一般化することができます。また、画面上に存在することを示すラベルが付けられた少量のドメイン内ビデオのみを使用して、視聴覚オンスクリーン分類器の確率を調整することにより、モデルの一般化機能をさらに改善する堅牢な方法を示します。評価と半教師ありトレーニングのために、オンザワイルドビデオの大規模なデータベース（YFCC100m）から画面上のオーディオの人間による注釈を収集しました。私たちの結果は、以前の方法よりも一般的な条件で、画面上の分離性能が著しく改善されたことを示しています。

We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous work on audio-visual on-screen sound separation, including the simplicity and coarse resolution of spatio-temporal attention, and poor convergence of the audio separation model. Our proposed model addresses these issues using cross-modal and self-attention modules that capture audio-visual dependencies at a finer resolution over time, and by unsupervised pre-training of audio separation model. These improvements allow the model to generalize to a much wider set of unseen videos. We also show a robust way to further improve the generalization capability of our models by calibrating the probabilities of our audio-visual on-screen classifier, using only a small amount of in-domain videos labeled for their on-screen presence. For evaluation and semi-supervised training, we collected human annotations of on-screen audio from a large database of in-the-wild videos (YFCC100m). Our results show marked improvements in on-screen separation performance, in more general conditions than previous methods.

updated: Thu Oct 14 2021 04:05:54 GMT+0000 (UTC)

published: Thu Jun 17 2021 17:23:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト