How Automatic Speech Recognition Works and Learns From You (Infographic)

The development of automatic speech recognition technology (ASR) and its complementary system known as interactive voice response technology (IVR) have both been major milestones in improving the way in which our machines communicate with us and vice versa.

These technologies have both been used in places such as voice recognition systems for computers (particularly for the disabled), more efficient text editing software and, most notably the designs of many modern smartphones and other mobile devices.

Samsung S-Voice and Apple SIri
photo credit: Mike Lau

A major and famous example of this last use of ASR and IVR is the Siri Interface found in newer model iPhones by Apple.

Now that we know what we’re talking about (and you’ve almost certainly seen these technologies in use at least at some point), let’s see just how the dual marvels of ASR and IVR manage to work so accurately when we speak to them.

Also, if you want even more information and better context on these fascinating technologies, check out this excellent infographic from the people at West Interactive.

How Automatic Speech Recognition (ASR) works - infographic

The Basics of ASR Technology

The core process by which automatic speech recognition intuits what we’re saying to it in a meaningful, responsive way follows the following fairly straightforward steps:

  1. You speak to an ASR enabled device
  2. The device creates a wave from the sounds you make
  3. The ASR software then cleans up background noise and normalizes sound volume
  4. The resulting filtered wave form (clean sound sequence of what you said) is broken down into what are called phonemes. (These are the essential sounds that form the letters of our words, there are 44 of them in English)
  5. Each phoneme is like a single link in a chain and by analyzing them in sequence, your device deduces complete words and then whole sentences that it “understands”.

Examples of How ASR is Applied

There are all sorts of uses for ASR technology but its two primary subdivisions can be labelled as “directed dialogue conversations” and “natural language conversations”

Directed dialogue conversations: These represent a simpler form of ASR/IVR technology in use and are found in situations where a computer voice asks you to select from a limited menu of word choices that it understands. A good example of this that we all know is your typical automated online banking menu.

Natural language conversations: These are a considerably more sophisticated form of ASR and represent systems with which we more openly interact in what is a basic type of conversation. Siri from the iPhone is an excellent example of natural language chats at work.

So How Does Natural Language Work?

Making natural language conversation work effectively is very hard. A typical 60,000 word vocabulary of a natural language ASR program can have as many as 216 trillion possible word combinations!

Thus, in order for your ASR program to know what you’re trying to say, what it does is react to a certain preselected list of tagged keywords that give it context for the gist of what you’re asking it. For example, if you say the word “forecast”, it will deduce that you’re likely also saying “weather” instead of “whether” and thus want a weather forecast.

This is the essential algorithm of natural language ASR at work and it becomes more complicated with larger word vocabularies and keyword lists, thus requiring more training.

The Tuning Test: How ASR is trained to “learn” from you

Any ASR system can either be “tuned” (trained) by humans or can be made to learn on its own on the fly through what is called active learning.

Human Tuning: This consists of human programmers manually reviewing the conversation and word logs of an ASR system to identify which new words and phrases have been used more often and then adding them to its dictionary as means of “teaching” the ASR.

Active Learning: This is a more sophisticated learning process in which your device’s ASR/IVR system is programmed to store and analyze data from past conversations and adopt it to new verbal exchanges. Thus, your ASR learns to adopt to your specific speech patterns and interpret them contextually. For example, the system might see that you repeatedly cancel the auto-correct on a certain word and thus learn to interpret that word as “correct” in future conversations.