Multimodal Transformer for Parallel Concatenated Variational Autoencoders

Stephen D. Liang; Jerry M. Mendel

並列連結変分オートエンコーダ用のマルチモーダル変換器

この論文では、並列連結アーキテクチャを使用したマルチモーダル変換器を提案します。パッチを使用する代わりに、トランス入力として R、G、B チャネルの画像に列ストライプを使用します。列のストライプは、元のイメージの空間関係を維持します。合成クロスモーダルデータ生成のために、マルチモーダルトランスフォーマーと変分オートエンコーダーを組み込みます。マルチモーダルトランスフォーマーは、複数の圧縮行列を使用して設計されており、Parallel Concatenated Variational AutoEncoders (PC-VAE) のエンコーダーとして機能します。 PC-VAE は、複数のエンコーダ、1 つの潜在空間、および 2 つのデコーダで構成されます。エンコーダーはランダムなガウス行列に基づいており、トレーニングは必要ありません。部分情報分解からの相互作用情報に基づく新しい損失関数を提案します。インタラクション情報は、入力クロスモーダル情報とデコーダ出力を評価します。 PC-VAE は、損失関数を最小化することによってトレーニングされます。 PC-VAE 用に提案されたマルチモーダルトランスフォーマーを検証するために実験が行われます。

In this paper, we propose a multimodal transformer using parallel concatenated architecture. Instead of using patches, we use column stripes for images in R, G, B channels as the transformer input. The column stripes keep the spatial relations of original image. We incorporate the multimodal transformer with variational autoencoder for synthetic cross-modal data generation. The multimodal transformer is designed using multiple compression matrices, and it serves as encoders for Parallel Concatenated Variational AutoEncoders (PC-VAE). The PC-VAE consists of multiple encoders, one latent space, and two decoders. The encoders are based on random Gaussian matrices and don't need any training. We propose a new loss function based on the interaction information from partial information decomposition. The interaction information evaluates the input cross-modal information and decoder output. The PC-VAE are trained via minimizing the loss function. Experiments are performed to validate the proposed multimodal transformer for PC-VAE.

updated: Fri Oct 28 2022 14:45:32 GMT+0000 (UTC)

published: Fri Oct 28 2022 14:45:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト