Multi-modal Latent Diffusion

Mustapha Bounoua; Giulio Franzese; Pietro Michiardi

マルチモーダル潜在拡散

マルチモーダルデータセットは現代のアプリケーションで広く普及しており、マルチモーダル変分オートエンコーダーは、さまざまなモダリティの共同表現を学習することを目的とした人気のあるモデルファミリです。しかし、既存のアプローチはコヒーレンス品質のトレードオフに悩まされており、生成品質が良いモデルにはモダリティ全体での生成コヒーレンスが欠けており、またその逆も同様です。別のアプローチの必要性を促すために、既存の方法の不満足なパフォーマンスの根底にある制限について説明します。我々は、独立してトレーニングされたユニモーダルな決定論的オートエンコーダーのセットを使用する新しい方法を提案します。個々の潜在変数は共通の潜在空間に連結され、それがマスクされた拡散モデルに供給されて生成モデリングが可能になります。また、マルチモーダル拡散の条件付きスコアネットワークを学習するための新しいマルチタイムトレーニング方法も導入します。大規模な実験キャンペーンを通じて示されたように、当社の方法論は生成品質と一貫性の両方で競合他社を大幅に上回っています。

Multi-modal data-sets are ubiquitous in modern applications, and multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities. However, existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities, and vice versa. We discuss the limitations underlying the unsatisfactory performance of existing methods, to motivate the need for a different approach. We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders. Individual latent variables are concatenated into a common latent space, which is fed to a masked diffusion model to enable generative modeling. We also introduce a new multi-time training method to learn the conditional score network for multi-modal diffusion. Our methodology substantially outperforms competitors in both generation quality and coherence, as shown through an extensive experimental campaign.

updated: Wed Jun 07 2023 14:16:44 GMT+0000 (UTC)

published: Wed Jun 07 2023 14:16:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト