Advertisement

Scientists create lip reading software with 93.4% accuracy; twice that of experienced humans

LipNet applies machine-learning principles to boost lip-reading by orders of magnitude

LipNet

As critical of a role as lip reading plays in understanding human speech, humans are surprisingly terrible at it. Trained lip-readers score no more than 52 % accuracy, let alone the average person. A machine-learning algorithm called LipNet upturns conventional lip-reading techniques, achieving an impressive 93.4 % accuracy.

Developed by researchers at University of Oxford Computer Science Department, LipNet isn’t the first software to take a crack at resolving the communication issue by analyzing lip movement, but it’s the first to use machine-learning.

Why are humans so incapable of lip reading when the McGurk effect provided how essential it is in lip-reading? First off, the McGurk effect is a perceptual phenomenon that demonstrates an interaction between hearing and vision in speech perception. It proved that seeing a person speak changes the way we hear sound—or think we hear sound. Sight overrides sound.

Previous solutions aiming to decode text from the movement of a speaker’s mouth used a two-prong approach, first designing and learning the visual features, and then making the prediction one word at a time.

“LipNet is a neural network architecture for lip-reading that maps variable-length sequences of video frames to text sequences and is trained end-to-end. Advancements in automatic speech recognition (ASR) owe their success to contemporary deep learning, many of which occurred in the context of ASR. The connectionist temporal classification loss (CTC) drove the movement from deep learning as a component of ASR, to deep ASR systems trained end-to-end. Much lip-reading progress has mirrored early progress ASR, but stopping short of sequence prediction.”

Translation: Now that we know human lip-reading performance increases for longer words, LipNet takes the entire spoken sentence into consideration before using Deep Learning to backtrack and decipher each word, rather than analyzing footage on a word-by-word basis. No lip-reading technique up to this point has performed sentence-level sequence prediction.

What impact does have for the rest of us outside of academia? Running the software on a smartphone or some later-to-come wearable like Google Glass could provide hearing-impaired individuals with another tool for understanding speech if linked with a body-camera. Regardless of how competent they are at lip-reaping, outsourcing some of that concentration on to a machine simply improves its accuracy.

Readers familiar with George Orwell’s frightening novel 1984 may reach a reach a slightly different conclusion on LipNet’s practicality: surveillance. One of the novel’s most unsettling elements was the telescreen, a two-way television-like device that could monitor and listen. In the wrong hands, software like LipNet makes no mistake interpreting lip reading, leaving no secret unheard.

To quote the novel:

“There was, of course, no way of knowing whether you were being watched at any given moment. How often, or on what system, the Thought Police plugged in on any individual wire was guesswork. It was even conceivable that they watched everybody all the time. But at any rate, they could plug in your wire whenever they wanted to. You have to live – did live, from habit that became instinct – in the assumption that every sound you made was overheard, and, except in darkness, every movement scrutinized.”

Source: Arxiv.org

Advertisement



Learn more about Electronic Products Magazine

Leave a Reply