EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen; Yunhao Gou; Runhui Huang; Zhili Liu; Daxin Tan; Jing Xu; Chunwei Wang; Yi Zhu; Yihan Zeng; Kuo Yang; Dingdong Wang; Kun Xiang; Haoyuan Li; Haoli Bai; Jianhua Han; Xiaohui Li; Weike Jin; Nian Xie; Yu Zhang; James T. Kwok; Hengshuang Zhao; Xiaodan Liang; Dit-Yan Yeung; Xiao Chen; Zhenguo Li; Wei Zhang; Qun Liu; Jun Yao; Lanqing Hong; Lu Hou; Hang Xu

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

updated: Tue Oct 29 2024 06:25:52 GMT+0000 (UTC)

published: Thu Sep 26 2024 16:44:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト