This blog is focused more on conversational software (personal assistants, virtual agents, chatbots) than speech / voice recognition technologies. But I thought it would be interesting to explore a product that helps you to voice enable an existing app or website, so that users can interact with it by talking or typing in natural language sentences.
There are lots of choices out there for adding voice control to apps. I looked at Ask Ziggy’s Application Programming Interface (API) offering to see just how it works. Back in early 2012, Ask Ziggy launched a personal assistant for the Windows Phone that was touted as a rival to Apple’s Siri. The Ask Ziggy app is still available on the Windows Phone Marketplace.
In addition to the personal assistant, Ask Ziggy now provides a natural language understanding (NLU) API and development portal. In fact, it looks like the focus of their business has shifted to providing the NLU API and associated platform. The API allows developers to operate voice-enabled apps that are local, cloud-based, or hybrid.
At the core of the NLU technology is an AT&T Speech Recognition API. There’s a very informative demo video of the NLU API portal on the Ask Ziggy website. Developers make REST-based calls using natural language as input. They receive “action-entity” JSON name/value pairs as output. Developers can then utilize the action-entity data to execute the appropriate function in their app.
Let’s dig into some of the specifics. The first thing you do in the NLU API portal is set up your entities. Entities are the things you want people to be able to search for using voice commands. In the demo, the sample app is a music app and the entities are things like songs, genres, artists, and bands. The API also comes with predefined entities for important data such as location and date/time. These predefined entities are populated by default, so the developer doesn’t have to worry about coding them from hand.
After setting up your entities, your next step is to create Developer Actions. Actions are key functions that the developer will need to perform based on the user requests. Examples of actions for the music app are Play, Skip, and Previous Song. For each action, you need to create some sample sentences that act as patterns for what users are likely to say. Sample sentences for the Play action might include: “Play a rock song” and “I want to hear some jazz.” The portal automatically tags the words Play and Hear as actions and the words Rock and Jazz as genre entities.
Once you’ve set up your entities and actions, created your sample sentences, and uploaded your data files, training runs in the background. Training allows the platform to work through the samples and compare to the live data and make sure that voice understanding is functional and reliable. The more sample sentences you add, the more reliable the NLU model. The portal also includes a test console feature for building and executing tests. You can input sentences and then see what output the model returns. The portal supports versioning.
Once the NLU model is working, the API should correctly interpret spoken user input into JSON output. The JSON output contains the entities and actions that the developer needs to execute the correct application function. In the music example, if someone says “Play some rock music by Queen,” the API will return enough information to instruct the application to find a rock song by Queen and and start playing it.
I may have oversimplified the description of the Ask Ziggy NLU API, but it seems like a viable tool for developers to use to quickly add voice understanding to their apps. I can even imagine the possibility of combining a more traditional chatbot application with this API. That way, apart from just carrying on a conversation based on pattern matching, the chatbot could interpret commands that are related to entities and actions and use this information to execute specific tasks.