China’s Google equivalent learns to mimic human voice in less than one minute

Deep Voice 3 teaches machines to speak by imitating thousands of human voices from people across the globe

By Jean-Jacques DeLisle, contributing writer

Baidu’s Deep Voice 3 software can clone anyone’s voice. Image source: Pixabay.

A breakthrough in digital voice emulation technology was recently released by Chinese Google equivalent, Baidu. Baidu claims that its new text-to-speech (TTS) system, known as Deep Voice 3, can learn to accurately replicate any human voice using less than one minute of audio. This advancement comes in the midst of a tech race to achieve more reliable TTS emulation software, with heavy hitters like Google already in the running with their “wavenet” TTS project. Adobe is also in the race, having recently unveiled its prototype TTS software “Project VOCO,” which can learn to mimic a voice in 20 minutes.

However, Baidu’s researchers used a different approach when confronting the text-to-speech dilemma and introduced something unique. The team implemented two different approaches into its design: speaker adaptation and speaker encoding. The two can work in different ways for different devices or can be used together, but the bottom line is that they get the job done faster than the competition.

Speaker adaptation works by a background propagation-based approach grounded in the multi-speaker generative model only to low-dimensional speaker embeddings. In other words, the program will form a model based on the sound of your voice and then run text-to-speech software throughout that model, simulating with relative accuracy at least the frequency and tone of your voice. This could be used with more simple devices and other programs that will allow you to set your iHome or Siri to a custom voice.

Speaker encoding works differently and combines the multi-speaker generative model with a separate model that generates a new speaker embedded from cloned audio. This approach dramatically reduces cloning times to just a few-second intervals and has very few working parameters, meaning that it could be achieved relatively cheaply and then easily deployed to existing devices. Such a form of voice simulation can replicate accents, tones, and subtle nuances in speech, creating a very convincing replication.

So what are the implications of this kind of voice cloning? Baidu hopes that it will be useful for all manner of devices, such as iHome or Siri, smartphones, GPSes, and more. Being able to hear the voice of a loved one, or even yourself, guide you through traffic would be much more pleasing to your ears than that of the computerized voice that we might hear now. But are the applications really that innocent? Wouldn’t this technology significantly lower the effectiveness of voice verification security? Could celebrities or politicians have their voices “stolen” and then used for malicious broadcasts or spreading misinformation? Could someone steal your voice and use it to threaten someone or commit some other crime in your name? For every new technology that we create, there’s a positive and a negative application, and this new TTS technology is no different.

Learn more about Electronic Products Magazine

China’s Google equivalent learns to mimic human voice in less than one minute

Leave a Reply Cancel reply

THE EDITOR'S PERSPECTIVE

Gina Roos

Automotive: evolving technologies and new innovations

Featured Videos

FOLLOW