AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

Guy Yariv; Itai Gat; Lior Wolf; Yossi Adi; Idan Schwartz

AudioToken: 音声から画像への生成のためのテキスト条件付き拡散モデルの適応

近年、画像生成のパフォーマンスは大幅に向上しており、拡散モデルが中心的な役割を果たしています。このようなモデルは高品質の画像を生成しますが、主にテキストによる説明を条件としています。このことから、「他のモダリティを条件とするこのようなモデルをどのように採用できるのか?」という疑問が生じます。この論文では、テキストから画像への生成のために訓練された潜在拡散モデルを利用して、音声録音を条件とした画像を生成する新しい方法を提案します。提案された方法は、事前トレーニングされたオーディオエンコードモデルを使用して、オーディオを新しいトークンにエンコードします。これは、オーディオとテキスト表現の間のアダプテーションレイヤーとみなすことができます。このようなモデリングパラダイムでは、少数のトレーニング可能なパラメーターが必要なため、提案されたアプローチは軽量の最適化にとって魅力的です。結果は、客観的および主観的な指標を考慮すると、提案された方法が評価されたベースライン方法よりも優れていることを示唆しています。コードとサンプルは、https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken から入手できます。

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: "how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.

updated: Mon May 22 2023 14:02:44 GMT+0000 (UTC)

published: Mon May 22 2023 14:02:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト