Learning Audio-Visual Correlations from Variational Cross-Modal Generation

Ye Zhu; Yu Wu; Hugo Latapie; Yi Yang; Yan Yan

変分クロスモーダル生成からの視聴覚相関の学習

人々はイベントを見ながら、潜在的な音を簡単に想像することができます。音声信号と視覚信号の間のこの自然な同期は、それらの固有の相関関係を明らかにします。この目的のために、我々は、自己監視方式でクロスモーダル生成の観点から視聴覚相関を学習することを提案し、学習された相関は、視聴覚クロスモーダルローカリゼーションなどの複数のダウンストリームタスクに容易に適用できます。と検索。複数のエンコーダーと、問題に取り組むための追加のワッサースタイン距離制約を備えた共有デコーダー（MS-VAE）で構成される新しいVariational AutoEncoder（VAE）フレームワークを紹介します。広範な実験は、提案されたMS-VAEの最適化された潜在的表現が視聴覚相関を効果的に学習でき、トレーニング中に特定のラベル情報がなくても競争力のあるパフォーマンスを達成するために複数の視聴覚ダウンストリームタスクに容易に適用できることを示しています。

People can easily imagine the potential sound while seeing an event. This natural synchronization between audio and visual signals reveals their intrinsic correlations. To this end, we propose to learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner, the learned correlations can be then readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval. We introduce a novel Variational AutoEncoder (VAE) framework that consists of Multiple encoders and a Shared decoder (MS-VAE) with an additional Wasserstein distance constraint to tackle the problem. Extensive experiments demonstrate that the optimized latent representation of the proposed MS-VAE can effectively learn the audio-visual correlations and can be readily applied in multiple audio-visual downstream tasks to achieve competitive performance even without any given label information during training.

updated: Fri Feb 05 2021 21:27:00 GMT+0000 (UTC)

published: Fri Feb 05 2021 21:27:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト