Multimodal Fusion Refiner Networks

Sethuraman Sankaran; David Yang; Ser-Nam Lim

マルチモーダルフュージョンリファイナーネットワーク

マルチモーダル情報に依存するタスクには、通常、さまざまなモダリティからの情報を組み合わせる融合モジュールが含まれます。この作業では、フュージョンモジュールが強力なユニモーダル表現と強力なマルチモーダル表現を組み合わせることができるRefiner Fusion Network（ReFNet）を開発します。 ReFNetは、フュージョンネットワークをデコード/デフューズモジュールと組み合わせて、モダリティ中心の責任条件を課します。このアプローチは、ユニモーダル表現と融合表現の両方が潜在的な融合空間で強力にエンコードされることを保証することにより、既存のマルチモーダル融合フレームワークの大きなギャップに対処します。リファイナーフュージョンネットワークは、マルチモーダルトランスなどの強力なベースラインフュージョンモジュールのパフォーマンスを向上させることができることを示しています。リファイナーネットワークは、潜在空間での融合埋め込みのグラフィカル表現を誘導することを可能にします。これは、特定の条件下で証明され、数値実験での強力な経験的結果によってサポートされています。これらのグラフ構造は、ReFNetを複数の類似性の対照損失関数と組み合わせることによってさらに強化されます。リファイナーフュージョンネットワークのモジュール性により、さまざまなフュージョンアーキテクチャと簡単に組み合わせることができます。さらに、リファイナーステップは、ラベルのないデータセットの事前トレーニングに適用できるため、教師なしデータを活用してパフォーマンスを向上させることができます。 3つのデータセットでRefinerFusion Networksの能力を実証し、さらに、ラベル付けされたデータのごく一部でパフォーマンスを維持できることを示します。

Tasks that rely on multi-modal information typically include a fusion module that combines information from different modalities. In this work, we develop a Refiner Fusion Network (ReFNet) that enables fusion modules to combine strong unimodal representation with strong multimodal representations. ReFNet combines the fusion network with a decoding/defusing module, which imposes a modality-centric responsibility condition. This approach addresses a big gap in existing multimodal fusion frameworks by ensuring that both unimodal and fused representations are strongly encoded in the latent fusion space. We demonstrate that the Refiner Fusion Network can improve upon performance of powerful baseline fusion modules such as multimodal transformers. The refiner network enables inducing graphical representations of the fused embeddings in the latent space, which we prove under certain conditions and is supported by strong empirical results in the numerical experiments. These graph structures are further strengthened by combining the ReFNet with a Multi-Similarity contrastive loss function. The modular nature of Refiner Fusion Network lends itself to be combined with different fusion architectures easily, and in addition, the refiner step can be applied for pre-training on unlabeled datasets, thus leveraging unsupervised data towards improving performance. We demonstrate the power of Refiner Fusion Networks on three datasets, and further show that they can maintain performance with only a small fraction of labeled data.

updated: Thu Apr 08 2021 00:02:01 GMT+0000 (UTC)

published: Thu Apr 08 2021 00:02:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト