微软开源语音生成模型：vall-e(x)

We extend VALL-E to a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis, and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker’s voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates foreign accent problems, which can be controlled by a language ID.

This page is for research demonstration purposes only.

来自 <https://www.microsoft.com/en-us/research/project/vall-e-x/vall-e-x/>

Model Overview

VALL-E X can synthesize personalized speech in another language for a monolingual speaker. Taking the phoneme sequences derived from the source and target text, and the source acoustic tokens derived from an audio codec model as prompts, VALL-E X is able to produce the acoustic tokens in the target language, which can be then decompressed to the target speech waveform. Thanks to its powerful in-context learning capabilities, VALL-E X does not require cross-lingual speech data of the same speakers for training and can perform various zero-shot cross-lingual speech generation tasks, such as cross-lingual text-to-speech synthesis and speech-to-speech translation.

来自 <https://www.microsoft.com/en-us/research/project/vall-e-x/vall-e-x/>