Developers have a new kind of voice interface option for their consumer products and apps, one that can also put on a face. Speech and vision technology company, Sensory, just announced a chatbot feature for its speech recognition platform, TrulyNatural. Simply put, the system supports animation of a face for an AI assistant, with movement synchronized to its speech, and it functions entirely on-device with no need for a live internet connection.
TrulyNatural can enable consumer products and applications to have a voice-driven interface that offers a more conversational style, according to a statement released by Sensory. Its new chatbot support allows for dialog management and scripting and is designed to dynamically shape a digital avatar’s mouth movements to reflect the words being spoken. According to CEO of Sensory, Todd Mozer, the new features allow developers to create a new kind of visual voice interface on consumer products and apps.
The avatar interface uses a non-linear morphing technology that allows facial and mouth movements between visemes (visual representations of phonemes) to look realistic, even though they’re completely automated. Sensory uses fairly conventional approaches to speech recognition behind the face of the chatbot, but they do not require a cloud connection. Mozer said his team has proprietary approaches to collapse model size and make robust speech recognition technology fit in a smaller footprint, allowing it to be embedded into a stand-alone system. “The traditional approaches we deploy include machine-learning techniques, statistical language modeling, a variety of natural language approaches (form filling, language parsing, bag of words, garbage modeling), hidden Markov modeling, deep-learning noise models, and deep-learning acoustic models,” Mozer said.
While cloud-based speech recognition is all the hype right now, Mozer told Electronic Products in an interview, not being in the cloud provides a number of advantages. Speed of response, consistency of availability, lower system cost, and lower bandwidth cost are some of the technical advantages. “And usage data is kept private,” said Mozer. The self-contained system is not able to recognize as many words as a cloud system, he added, but “Sensory is not trying to make a general-purpose assistant. It’s a domain-specific approach that can be applied to household products or a kiosk.”
One goal of the chatbot, which targets consumer devices and mobile apps, is to improve business transactions. For example, a fast food chain could have an avatar that takes and confirms orders. According to Mozer, this approach could be less prone to errors and, unlike a human employee, the avatar wouldn’t get tired. For the consumer, such an approach could make the experience of waiting in line shorter.
Pairing a face with a voice is only the beginning, though. The future of AI, Mozer said, will take on many forms. Some AI will be equipped with just a talking assistant, while other digital assistants will have faces and personalities. The experience could be embedded, and other times, it will happen in the cloud. Overall, there isn’t any one right approach, because different situations present different needs, and the technology will continue to evolve.
“Ironically, Sensory saw the concept of a wake word as a niche area 10 years ago,” said Mozer. “Everyone was hitting buttons to call up recognizers and we came up with an approach where you could just call the device. I think Sensory actually developed the first Siri and Google triggers.” Now, the use of wake words is common, and perhaps one day, so will be a chatbot avatar.
However it evolves, the future of AI looks promising.
Learn more about Electronic Products Magazine