Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual   Signals

Shah Nawaz; Muhammad Kamran Janjua; Ignazio Gallo; Arif Mahmood; Alessandro Calefati

オーディオ信号とビジュアル信号のクロスモーダルマッピングのための深潜在空間学習

Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals

マルチモーダル情報の共有された深い潜在空間表現を学習するために、新しい損失関数と結合した単一ストリームネットワーク（SSNet）で構成される、音声および視覚情報の共同表現のための新しい深層トレーニングアルゴリズムを提案します。提案されたフレームワークは、ペアワイズまたはトリプレットの監督の必要性を排除するのに役立つクラスセンターを活用することにより、共有の潜在空間を特徴付けます。クロスモーダル検証、クロスモーダルマッチング、クロスモーダル検索などの多数のタスクに関するベンチマーク視聴覚データセットであるVoxCelebで提案されたアプローチを定量的および定性的に評価します。クロスモーダル検証とマッチングで最先端のパフォーマンスが達成され、残りのアプリケーションで同等の結果が観察されます。私たちの実験は、クロスモーダル生体認証アプリケーションの技術の有効性を示しています。

We propose a novel deep training algorithm for joint representation of audio and visual information which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information. The proposed framework characterizes the shared latent space by leveraging the class centers which helps to eliminate the need for pairwise or triplet supervision. We quantitatively and qualitatively evaluate the proposed approach on VoxCeleb, a benchmarks audio-visual dataset on a multitude of tasks including cross-modal verification, cross-modal matching, and cross-modal retrieval. State-of-the-art performance is achieved on cross-modal verification and matching while comparable results are observed on the remaining applications. Our experiments demonstrate the effectiveness of the technique for cross-modal biometric applications.

updated: Wed Sep 18 2019 20:18:44 GMT+0000 (UTC)

published: Wed Sep 18 2019 20:18:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト