
Speech Technology and Apps
The role of speech technology on mobile phones
The WIMP interface (Window, Icon, Menu, and Pointing device)—the familiar Graphical User Interface—has proved effective on PCs and in Web browsers. The most powerful aspect of WIMP is its familiarity—one can usually find some feature or data eventually by navigating through windows or pages of information. “Eventually” and “usually” are increasingly becoming the operant words, however, as the number of features and the amount of information we encounter grow exponentially. This PC paradigm is unlikely to transfer to small mobile devices without suffering. The small screen is more like a porthole than a window.
Nevertheless, a WIMP interface complements speech, allowing data to be delivered as text or graphically, and providing a fallback solution when the environment is not conducive to speaking. The Mobile Voice philosophy is an alternative, and can serve as the primary model for the user: “Say or type what you want.”
Audio search (audio mining, speech analytics): Searching an audio source or sources for specific content (e.g., keywords or subjects). Speech analytics goes beyond looking for single occurrences of phrases to a broader analysis of the context of a search phrase, and may use metadata (text sources that label the file or its location, for example).
Dialog (dialogue): In an automated speech context, a turn-taking conversation between a person and an automated system to move toward a goal.
Grammars (defined grammars): In a speech recognition system, a specification of the range of possible responses by a speaker in a particular context compiled by a designer/developer, e.g., in response to a prompt such as, “What is your account number?”
Hidden Markov Models (HMM): A statistical method at the heart of much of today’s speech recognition technology.
Interactive Voice Response (IVR): A telephony platform for dialog systems, with touch-tone interaction and recorded voice response at a minimum, and speech recognition and text-to-speech a growing option.
IP Telephony: A growing trend toward “convergence” of telephony and computer/Internet standards that makes telephony less of a specialized technology. The trend doesn’t specifically imply speech technologies, but makes them easier to add.
Multimodal (multi-modal): In the context of speech technology, mixing speech input or output with other modes of user input or system output, e.g., keyboards or a stylus for user input or text display for system output. Most telephone applications are multimodal if one includes the keypad (touch-tone interaction), but this adjective usually refers to modalities that include more than voice and touch-tone.
Speaker verification (speaker authentication, speaker recognition, Voice ID, voiceprints): A biometric identification using the quality of the person’s voice, sometimes supplemented by requiring content (such as a password or account number) known only to the person. Normally, the speaker makes a claim of identity (e.g., through an account number), and the system verifies that claim. However, speaker recognition can refer to discrimination of which of a number of potential speakers the voice belongs to, often requiring a different technical approach.
Speech Recognition (Automated Speech Recognition, ASR, speech-to-text, voice recognition): Automated recognition of the content of speech for the purposes of representing it as text or taking an appropriate action.
Spoken language identification determines the language being spoken, either in an interactive application or as batch processing.
Statistical Language Model (SLM): A specification of what the user may speak that is less constrained than defined grammars. Typically, an SLM is created from a text database of typical responses or finished text, and generalizes those examples.
Text-To-Speech (TTS, text-to-speech synthesis): Given text, automatically speaking that text in a synthetic voice (typically using a phonetic dictionary and letter-to-sound rules for words not in the dictionary.
Voice Search in its broadest use is a user interface philosophy, summarized roughly as taking advantage of advances in speech technology to reduce the amount of dialog required to achieve a task, as opposed to using a deep hierarchy to reduce the speech recognition problem at the expense of making navigation slow and/or non-intuitive. In a more narrow use, it can refer to initiating a Web search by speaking the search terms rather than typing them.
Voice User Interface (VUI): The collective term for using speech technology to interact with a user.
VoiceXML (Voice eXtensible Markup Language): A standard for speech dialog systems, particularly oriented toward telephony.