Learning Branched Fusion and Orthogonal Projection for Face-Voice Association

Muhammad Saad Saeed; Shah Nawaz; Muhammad Haris Khan; Sajid Javed; Muhammad Haroon Yousaf; Alessio Del Bue

顔と声の関連付けのための分岐融合と直交射影の学習

近年、YouTube の視聴覚情報を活用して有名人の顔と声を関連付けることへの関心が高まっています。以前の研究では、メトリック学習法を採用して、関連するマッチングおよび検証タスクに適した埋め込みスペースを学習しました。ある程度の進歩は見られますが、このような定式化は、距離に依存するマージンパラメータへの依存、ランタイムトレーニングの複雑さの不足、および慎重に作成されたネガティブマイニング手順への依存により制限的です。この作業では、顔と声の関連付けタスクの識別可能な共同埋め込み空間を実現するために、効果的かつ効率的な監視と組み合わせた豊かな表現が重要であると仮定します。この目的のために、軽量のプラグアンドプレイメカニズムを提案します。これは、両方のモダリティで補完的な手がかりを利用して、強化された融合埋め込みを形成し、直交性制約を介してアイデンティティラベルに基づいてそれらをクラスター化します。提案したメカニズムを Fusion and Orthogonal Projection (FOP) と呼び、2 ストリームネットワークでインスタンス化します。結果として得られる全体的なフレームワークは、VoxCeleb1 および MAV-Celeb データセットで評価され、クロスモーダル検証とマッチングを含む多数のタスクが実行されます。結果は、私たちの方法が現在の最先端の方法に対して有利に機能し、提案された監督の定式化が現代の方法で採用されているものよりも効果的かつ効率的であることを明らかにしています。さらに、クロスモーダル検証とマッチングタスクを活用して、複数の言語が顔と声の関連付けに与える影響を分析します。コードが利用可能です: https://github.com/msaadsaeed/FOP

Recent years have seen an increased interest in establishing association between faces and voices of celebrities leveraging audio-visual information from YouTube. Prior works adopt metric learning methods to learn an embedding space that is amenable for associated matching and verification tasks. Albeit showing some progress, such formulations are, however, restrictive due to dependency on distance-dependent margin parameter, poor run-time training complexity, and reliance on carefully crafted negative mining procedures. In this work, we hypothesize that an enriched representation coupled with an effective yet efficient supervision is important towards realizing a discriminative joint embedding space for face-voice association tasks. To this end, we propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints. We coin our proposed mechanism as fusion and orthogonal projection (FOP) and instantiate in a two-stream network. The overall resulting framework is evaluated on VoxCeleb1 and MAV-Celeb datasets with a multitude of tasks, including cross-modal verification and matching. Results reveal that our method performs favourably against the current state-of-the-art methods and our proposed formulation of supervision is more effective and efficient than the ones employed by the contemporary methods. In addition, we leverage cross-modal verification and matching tasks to analyze the impact of multiple languages on face-voice association. Code is available: https://github.com/msaadsaeed/FOP

updated: Mon Aug 22 2022 12:23:09 GMT+0000 (UTC)

published: Mon Aug 22 2022 12:23:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト