
By Heather Hamilton, contributing writer
 
In a study that has not yet undergone peer review, Google recently discussed the Tacotron 2, a text-to-speech system that alleges accuracy that is near human when turning text into audible files. The system is in its second generation and consists of two deep neural networks that translate text into a spectogram, which is then given to WaveNet (a system from DeepMind, an Alphabet research lab, responsible for interpreting the chart and generating the audio that will play back).
“The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time domain waveforms from those spectrograms,” explains the research publication. “Our model achieves a mean opinion score (MOS) of 4.53, comparable to an MOS of 4.58 for professionally recorded speech.”
The quest to generate authentic-sounding speech is ongoing, and Tacotron 2 certainly represents the closest that Google has come. TechCrunch writes that Google is arguably in the lead now, noting that it is a hot pursuit for many companies.
Quartz offers the samples conveniently side by side, one of which is generated by AI and the other by an actual human. It is unclear where each sample originates from. They note that, if you look at the page source on Google’s research website, one is named “gen” and is likely the generated sample. Both voices sound vaguely machine-like and vaguely human, making the distinction difficult to the naked ear.
Tacotron 2 doesn’t struggle with names or words that are more difficult to pronounce and responds to punctuation in a similar way as human speakers. When a word is capitalized, the AI will add emphasis, for example.
The Google voice assistant first utilized WaveNet in 2016, though Tacotron 2 is an improved system that could make the assistant more powerful. Currently, WaveNet is used in Japanese and English Google Assistants and is unique because it doesn’t require access to a large database of pre-recorded sounds because it makes its own via spectograms. Right now, the system is capable of parroting one female voice, and Google would need to offer additional system training to include male or other female voices.
The researchers acknowledge that the system occasionally generates weird noises and struggles to pronounce words that don’t adhere to rules — Merlot and decorum, for example. Tacotron 2 doesn’t control tone, either, but, as TechCrunch points out, “accents and other subtleties can be baked in as they could be with WaveNet.”
Sources: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS, TechCrunch, Quartz, Google
Image Source: Pixabay
 
Advertisement

 
			 
			 
			 
			 
			 
			




