Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Hyungyung Lee; Sungjin Park; Joonseok Lee; Edward Choi

マルチモーダルクロス量子化器による無条件の画像とテキストのペア生成

深い生成モデルは多くの注目を集めていますが、既存の作品のほとんどは単峰生成用に設計されています。この論文では、無条件の画像とテキストのペアを生成するための新しい方法を探ります。マルチモーダルクロス量子化 VAE (MXQ-VAE) を設計します。これは、画像とテキストの結合表現用の新しいベクトル量子化器であり、画像とテキストの結合表現空間が意味的に一貫した画像とテキストのペア生成に効果的であることを発見しました。量子化された空間でマルチモーダルなセマンティック相関を学習するには、VQ-VAE を Transformer エンコーダーと組み合わせて、入力マスキング戦略を適用します。具体的には、MXQ-VAE はマスクされた画像とテキストのペアを入力として受け入れ、量子化された結合表現空間を学習して、入力を統一されたコードシーケンスに変換できるようにし、コードシーケンスを使用して無条件の画像とテキストのペア生成を実行します。広範な実験により、量子化された関節空間と、合成データセットおよび現実世界のデータセットに対するマルチモーダル生成機能との間の相関関係が示されています。さらに、いくつかのベースラインよりもこれらの 2 つの側面で私たちのアプローチの優位性を示しています。ソースコードは、https://github.com/ttumyche/MXQ-VAE で公開されています。

Although deep generative models have gained a lot of attention, most of the existing works are designed for unimodal generation. In this paper, we explore a new method for unconditional image-text pair generation. We design Multimodal Cross-Quantization VAE (MXQ-VAE), a novel vector quantizer for joint image-text representations, with which we discover that a joint image-text representation space is effective for semantically consistent image-text pair generation. To learn a multimodal semantic correlation in a quantized space, we combine VQ-VAE with a Transformer encoder and apply an input masking strategy. Specifically, MXQ-VAE accepts a masked image-text pair as input and learns a quantized joint representation space, so that the input can be converted to a unified code sequence, then we perform unconditional image-text pair generation with the code sequence. Extensive experiments show the correlation between the quantized joint space and the multimodal generation capability on synthetic and real-world datasets. In addition, we demonstrate the superiority of our approach in these two aspects over several baselines. The source code is publicly available at: https://github.com/ttumyche/MXQ-VAE.

updated: Fri Oct 14 2022 13:01:42 GMT+0000 (UTC)

published: Fri Apr 15 2022 16:29:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト