Cascade Attention Guided Residue Learning GAN for Cross-Modal Translation

Bin Duan; Wei Wang; Hao Tang; Hugo Latapie; Yan Yan

クロスモーダル翻訳のためのカスケード注意ガイド付き残差学習GAN

私たちは赤ちゃんだったので、視覚、音声、テキストなどのさまざまな認知センサーからの入力を相互に関連付ける機能を直感的に開発しました。ただし、機械学習では、異なるモダリティには均一なプロパティがないため、このクロスモーダル学習は重要なタスクです。以前の作品は、異なるモダリティの間に橋があるべきであることを発見しています。神経学および心理学の観点から、人間は、あるモダリティを別のモダリティにリンクする能力を持っています。たとえば、鳥の写真をその歌声の唯一の聴覚に関連付ける、またはその逆の場合です。機械学習アルゴリズムが音声信号を与えられたシーンを回復することは可能ですか？本論文では、対応する音声信号が与えられたシーンを再構築することを目的とした、新しいカスケード注意誘導残差GAN（CAR-GAN）を提案します。特に、異なるモダリティ間のギャップを徐々に緩和するための残差モジュールを提示します。さらに、新しい分類損失関数を備えたカスケード注意誘導ネットワークは、クロスモーダル学習タスクに取り組むように設計されています。私たちのモデルは、高レベルのセマンティックラベルドメインの一貫性を維持し、2つの異なるモダリティのバランスをとることができます。実験結果は、私たちのモデルが挑戦的なサブURMPデータセットで最先端のクロスモーダルオーディオビジュアル生成を達成することを示しています。コードはhttps://github.com/tuffr5/CAR-GANで入手できます。

Since we were babies, we intuitively develop the ability to correlate the input from different cognitive sensors such as vision, audio, and text. However, in machine learning, this cross-modal learning is a nontrivial task because different modalities have no homogeneous properties. Previous works discover that there should be bridges among different modalities. From neurology and psychology perspective, humans have the capacity to link one modality with another one, e.g., associating a picture of a bird with the only hearing of its singing and vice versa. Is it possible for machine learning algorithms to recover the scene given the audio signal? In this paper, we propose a novel Cascade Attention-Guided Residue GAN (CAR-GAN), aiming at reconstructing the scenes given the corresponding audio signals. Particularly, we present a residue module to mitigate the gap between different modalities progressively. Moreover, a cascade attention guided network with a novel classification loss function is designed to tackle the cross-modal learning task. Our model keeps the consistency in high-level semantic label domain and is able to balance two different modalities. The experimental results demonstrate that our model achieves the state-of-the-art cross-modal audio-visual generation on the challenging Sub-URMP dataset. Code will be available at https://github.com/tuffr5/CAR-GAN.

updated: Fri Dec 10 2021 18:52:25 GMT+0000 (UTC)

published: Wed Jul 03 2019 10:04:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト