CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

Hao-Wen Dong; Naoya Takahashi; Yuki Mitsufuji; Julian McAuley; Taylor Berg-Kirkpatrick

CLIPSep: ノイズの多いラベルのないビデオを使用したテキストクエリによる音の分離の学習

近年、スピーチや音楽のドメイン固有の音分離を超えて、任意の音の普遍的な音分離に向けた進歩が見られます。普遍的な音の分離に関する以前の研究では、テキストクエリが与えられたオーディオ混合物からターゲットサウンドを分離することが調査されました。このようなテキストクエリによる音分離システムは、任意の対象音を指定するための自然でスケーラブルなインターフェイスを提供します。ただし、教師ありのテキストクエリによる音声分離システムでは、トレーニング用にラベル付けされた音声とテキストのペアのコストが高くなります。さらに、既存のデータセットで提供されるオーディオは、多くの場合、制御された環境で録音されているため、実際のノイズの多いオーディオとはかなりの一般化ギャップが生じます。この作業では、ラベル付けされていないデータのみを使用して、テキストクエリによる普遍的な音の分離にアプローチすることを目指しています。私たちは視覚モダリティを架け橋として活用し、望ましい音声とテキストの対応を学習することを提案します。提案された CLIPSep モデルは、最初に対照的な言語イメージ事前トレーニング (CLIP) モデルを使用して入力クエリをクエリベクトルにエンコードし、次にクエリベクトルを使用して音声分離モデルを調整し、ターゲットサウンドを分離します。モデルはラベル付けされていないビデオから抽出された画像と音声のペアでトレーニングされますが、テスト時に、CLIP モデルによって学習された言語と画像の共同埋め込みのおかげで、代わりにゼロショット設定でテキスト入力を使用してモデルをクエリできます。さらに、実際のビデオには画面外の音やバックグラウンドノイズが含まれていることが多く、モデルが目的の音声とテキストの対応関係を学習するのを妨げる可能性があります。この問題に対処するために、ノイズの多いデータでクエリベースの音分離モデルをトレーニングするためのノイズ不変トレーニングと呼ばれるアプローチをさらに提案します。実験結果は、提案されたモデルが、ノイズの多いラベルのないビデオのみを使用して、テキストクエリによる普遍的な音分離を正常に学習し、一部の設定では教師ありモデルに対して競争力のあるパフォーマンスを達成することを示しています。

Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.

updated: Fri Mar 03 2023 08:37:38 GMT+0000 (UTC)

published: Wed Dec 14 2022 07:21:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト