Microsoft’s new AI simulates anyone’s voice with 3 second long sample

AI representation

VALL-E is a new text-to-speech AI model developed by Microsoft researchers. It can closely simulate a person’s voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything, and it attempts to preserve the speaker’s emotional tone. Microsoft calls VALL-E a “neural codec language model,” and it builds off of a technology called EnCodec. Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript, and audio content creation when combined with other generative AI models.

vall-e

Image by Microsoft

Microsoft trained VALL-E’s speech-synthesis capabilities on an audio library called LibriLight. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. Some VALL-E results seem computer-generated, but others could potentially be mistaken for a human’s speech, which is the goal of the model. I wonder if there’s a Singaporean accent in the library.

However, given the potential risks of misuse, Microsoft has not provided VALL-E code for others to experiment with. The researchers are aware of the potential social harm that this technology could bring and address it in the conclusion of their paper. OK, I guess that’s one less thing to worry about now.

Leave a Reply

Your email address will not be published. Required fields are marked *