Advertisement

Ever wonder how Alexa and Siri work? Here’s voice control explained

We can talk to our gadgets now, but how exactly do they work?

We now live in a decade where a significant amount of attention is given to voice-driven computing on gadgets. We can ask, “What song is playing?” or “Call Mom,” and through modern technology, we get a response from our smartphones. Apple, Amazon, Microsoft, and Google are amongst the top companies that offer a way to speak to your electronics. By now, we know who they are: Siri, Alexa, Cortana, and the nameless “Ok, Google” responder.

voice-control-explained

But this raises a question: How does a device take spoken words and translate then into commands it can understand? In a nutshell, it’s done by pattern matching and making predictions based on those outlines. But more specifically, voice recognition is a multifaceted task that is produced by Acoustic Modeling and Language Modeling a la machine-learning algorithms.

Acoustic Modeling is the process of taking a waveform of speech and analyzing it using statistical models. Hidden Markov Modeling is the most common method used in Acoustic Modeling; using what’s referred to as pronunciation modeling, the technique breaks speech down into component parts called phones (note: not the actual phone devices). The Hidden Markov Modeling is a predictive mathematical model, where the current state is determined by analyzing the output.

acoustic-modeling

For example, the model on Matlab uses voice recognition to compare each part of the waveform against what comes before and after, and against a dictionary of waveforms to decipher what is being said. Breaking it down further, if you make a “th” sound, it checks that sound against the most probable sounds that typically come before or after it. This means it would most likely check against the “e” sound or “at” sound first and then so on until a match was found.

While Acoustic Modeling is helpful in getting the computer to understand you, what about homonyms or regional variations in pronunciation? The answer is Language Modeling. Google has spent a considerable amount of research in this, in particular through the use of N-gram Modeling. When Google Voice interprets your speech, it uses models derived from its immense bank of Voice Search and YouTube subscriptions. The tech giant also used GOOG-411 to collect data on how people speak.

N-gram Modeling is based on probabilities, but also uses an existing dictionary of words to create a realm of possibilities. The method eliminates much of the uncertainty that may arise from the Hidden Markov Modeling. The strength of N-gram Modeling comes from having such a large dictionary of words and their usage, not just unadulterated sounds. It gives the program the ability to differentiate between homophones, such as “beat” and “beet.” It’s also contextual, meaning that when you’re talking about last night’s game score, the program won’t bring up something entirely unrelated.

N-gram-modeling

Those who have used Siri can understand the frustration behind a slow network connection. Commands are sent to Siri over the network and need to be decoded by Apple. Cortana for Windows phone also requires a network connection to function. In contrast, Amazon’s Echo is a simple Bluetooth speaker without any Internet that connects to Alexa Voice Service.

The difference arises because Siri and Cortana require heavy duty servers to decode your speech. Sure, the decoding could be done on your tablet or smartphone, but that would kill performance and battery life during the process. From a technical standpoint, it makes more sense to offload the process to dedicated machines.

Now that we understand the fundamental concepts and models of voice recognition, it’s time to play around with our devices. You can set your Mac up to use voice commands, try the new typing in Google Docs, or set up Amazon Echo with automated checkout.

Source: MakeUseOf

Advertisement



Learn more about Electronic Products Magazine

Leave a Reply