VALL-E is a new text-to-speech AI model developed by Microsoft researchers. It can closely simulate a person’s voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything, and it attempts to preserve the speaker’s emotional tone. Microsoft calls VALL-E a “neural codec language model,” and it builds off of a technology called EnCodec. Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript, and audio content creation when combined with other generative AI models.
Image by Microsoft
Microsoft trained VALL-E’s speech-synthesis capabilities on an audio library called LibriLight. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. Some VALL-E results seem computer-generated, but others could potentially be mistaken for a human’s speech, which is the goal of the model. I wonder if there’s a Singaporean accent in the library.
However, given the potential risks of misuse, Microsoft has not provided VALL-E code for others to experiment with. The researchers are aware of the potential social harm that this technology could bring and address it in the conclusion of their paper. OK, I guess that’s one less thing to worry about now.