Dr. Dobb's Journal September 1997
As humans, we see, hear, feel, and smell. Human interaction is enriched by the concomitant redundancy introduced by multimodal communication. In contrast, computer interfaces have relied primarily on visual interaction -- today's interfaces are like the silent movies of the past. As we approach the turn of the century, computers now have the ability to talk, listen, and perhaps even understand. Integrating new modalities such as speech into human-computer interaction requires rethinking how applications are designed in today's world of visual computing.
In this article, I'll describe Emacspeak -- a full-fledged speech interface that provides a powerful audio desktop. Emacspeak was originally designed to provide visually impaired users with productive access to the wealth of network computing resources available on UNIX platforms.
However, the user-interface innovations introduced by Emacspeak have a wider relevance to user-interface design for information appliances of the future. The key innovation is the speech-enabling approach used in implementing Emacspeak -- a technique that allows the separation of the computational component of an application from its user interface.
Screen-readers, computer software that enables a visually impaired user to read the contents of a visual display, have been available for more than a decade. I used traditional screen-reading applications for five years (1990-1995). Screen readers are separate from the user application. Consequently, they have little or no contextual information about the contents of the display.
I implemented the speech-enabling approach in Emacspeak to overcome many of the shortcomings encountered with traditional screen-readers. Screen-readers allow users to listen to the contents appearing in different parts of the display, but users are entirely responsible for building a mental model of the visual display to interpret what an application is trying to convey.
Emacspeak, on the other hand, does not speak the screen: It provides speech feedback from within the application, allowing the application to render information in both a visually and aurally pleasing way. Emacspeak audio-formats the text (analogous to visual formatting) and augments the spoken text with nonspeech auditory icons to succinctly convey information other than the text itself (such as events triggered by the application). This vastly improves the quality of the spoken feedback.
Emacspeak is available and described in detail at http://cs.cornell.edu/home/raman/emacspeak/. The design and implementation of high-quality auditory interfaces is covered in my book Auditory User Interfaces: Toward The Speaking Computer (Kluwer Academic Publishers, August 1997). In this article, I'll focus on the design and implementation of Emacspeak with an emphasis on the software techniques used in implementing the system. A salient feature of this implementation is the extensive reuse of existing functionality within emacs.
Emacspeak provides core speech services along with application-specific extensions. These extensions expose the functionality provided by the application via an auditory interface. Speech-enabling extensions are designed to be independent of the underlying implementation of the application.
The Emacspeak design is motivated by the following goals:
Figure 1 provides an overview of the Emacspeak architecture. Emacspeak is implemented as a collection of layers, where each layer depends on the underlying components. All but the lowest layers of the system are device independent. The current implementation includes a speech server (currently implemented in Tcl) for communicating with the speech device. In this architecture, Emacspeak is a speech client that relies on the speech server for providing a device-independent interface to the speech output device.
The advantages of this architecture are:
In the case of a hardware speech device, a device-specific script acts as a device driver. In the case of a software speech synthesizer, speech API calls can be directly exposed through the Tcl shell. Speech clients, such as Emacspeak, function without being aware of how the speech is being produced. Since the speech clients do not depend on the scripting language used to expose the speech API, languages other than Tcl can be used to implement the speech server.
Speech-synthesis systems typically allow client applications to queue text to be spoken, interrupt and flush ongoing speech, and set speech-synthesis parameters that influence voice characteristics and pronunciation. Emacspeak can interface to either a hardware speech synthesizer or a software speech-synthesis library, since both are encapsulated by the speech server. When interfacing with a hardware synthesizer, the device-specific file implements speech server commands that send appropriate control codes to the device. When interfacing to a software speech-synthesis library, these same server commands are provided by implementing Tcl wrappers around the speech library functions. In either case, Emacspeak remains unaware of the distinction.
The core services provided by Emacspeak have proven adequate in implementing the audio desktop. Designing a good set of core services that pervade the environment has encouraged code reuse. This was of paramount importance, since I developed Emacspeak entirely in my spare time. Use of these core services by the rest of the audio desktop has also ensured a consistent sound and feel throughout the Emacspeak environment.
The Emacspeak desktop is composed of a collection of modules that can be classified as:
Here is a brief overview of the basic speech and nonspeech audio services available on the Emacspeak desktop. These services help isolate the rest of the system from all device dependencies.
Spoken Output. The basic speech services implement functions that can speak logical chunks of information (words, lines, or paragraphs of text). By default, all these functions flush ongoing speech immediately; as a result, clients can expect a prompt response.
Functions that speak larger units of information first split the text to be spoken into logical chunks before queuing it to the speech device. These logical chunks are determined by clause boundaries, and higher-level applications can flexibly set these clause boundaries based on context. Thus, the techniques used to split up a paragraph of English prose can be significantly different from those used to split up a piece of C program code. This helps Emacspeak produce a high level of intonational structure in the spoken output.
Custom Pronunciations. Most speech-synthesis systems allow users to provide a custom pronunciation dictionary that is used to tailor the way specific words or phrases are pronounced. However, such user-defined custom dictionaries are global in effect; once defined, a pronunciation remains active until changed.
In practice, this behavior is undesirable. For example, when speaking a date, the correct pronunciation for "Jan" is "January." However, defining this as a custom pronunciation would result in all occurrences of "Jan" being spoken as "January" -- clearly, an undesired effect. Simple cases such as speaking dates can be treated as a special case; but the example just mentioned illustrates a more general problem: Pronunciation rules are tightly coupled to the context of an utterance. A custom pronunciation dictionary that is globally applicable is therefore inappropriate for the audio desktop.
Since all speech-enabled applications on the Emacspeak desktop have complete access to the current context of the utterance to be synthesized, the quality of spoken output is vastly improved by enabling these applications to supply their own custom dictionaries.
Audio Formatting. Audio formatting is analogous to visual formatting. Visual formatting uses the various features of a display to present information in a manner that can be efficiently processed by the eye. Audio formatting uses features of an auditory display to communicate information effectively and succinctly (see my paper, "Audio System for Technical Readings," Ph.D. thesis, Cornell University, May 1994, for details).
Text to be spoken by Emacspeak can have one or more aural properties applied to it. A text string along with the attached properties can be thought of as a conceptual aural display list. The system provides a simple API for applications to attach such properties to text; when such text is later spoken, the underlying system examines any properties that may have been attached to a given utterance and produces the specified variations in voice characteristic. Allowing applications to attach aural properties to text and thereby produce a conceptual aural display list that is consistently handled by the underlying speech interface has many advantages. Different parts of the system can attach aural properties as appropriate, and have the cumulative effect correctly rendered.
Auditory Icons. Speech-enabled applications running under Emacspeak use auditory icons -- short snippets of sound varying between 0.1 and 2 seconds in length -- to augment speech feedback. As in the case of predefined voices, Emacspeak defines a standard set of auditory icons along with conventions for when and where these are used in the interface. This enables applications to give effective and expected auditory feedback to the experienced user, and to reinforce the consistent sound and feel of the audio desktop.
Auditory icons in Emacspeak can be categorized as sounds that perform the following services:
1. Alert the user to specific events in the interface (the arrival of new mail, for instance).
2. Confirm user action (saving a file, for example).
3. Indicate changes in state resulting from user actions (selecting and deselecting objects).
4. Cue user to specific kinds of objects (paragraphs).
5. Augment prompts (asking a "yes" or "no" question, for instance).
6. Indicate ongoing activity (progress monitors).
Application-specific extensions are organized into separate modules. This allows you to maintain and update parts of Emacspeak without disturbing the rest of the system. In the last two years, a large number of application-specific extensions have been added with little if any change to the Emacspeak core (Table 1 lists some of the extensions). These extensions are mutually independent -- in fact, an extension is loaded if and only if the corresponding application is invoked. Loading extensions on demand helps reduce the amount of system resources used; it also ensures that different extensions do not inadvertently depend on one another.
Typically, these extensions provide a few additional commands for obtaining spoken feedback about the current application context. Extensions provide a significant chunk of the functionality by advising key parts of the original application to provide context-specific feedback. These fragments of advice and the new user-level commands rely on the basic services of the Emacspeak desktop. At the outset, I designed the Emacspeak core by implementing a few key extensions, then factoring out common code into the core. Newer extensions have contributed minimally to the core.
Emacspeak was designed as an extension to emacs, a large and evolving software system. It was therefore crucial to ensure that Emacspeak itself did not require modifying the emacs code base in any way, since doing so would have made Emacspeak dependent on a specific version of emacs.
Emacspeak currently works with no changes to the emacs distribution. In fact, Emacspeak works "out of the box" with both GNU Emacs and XEmacs. This is achieved by implementing all Emacspeak extensions via the Lisp advice facility.
Advice is a powerful mechanism for code extension and re-use. Given a function y = f(x1, x2, ..., xn), the advice facility can be used to attach additional functionality to function f without changing function f itself. Such advice fragments may be one of three types:
All advice fragments have access to the arguments passed to f. In addition, after and around advice fragments have access to the result computed by f and may modify that value before it is returned. Thus, using advice, given function y = f(x1, x2, ..., xn), it is possible to synthesize a new function f' such that y' = f'(x1, x2, ..., xn) without modifying source code to function f. Figure 2 illustrates the calling sequence for a function f that has before, around, and after advices.
All user-level functionality in the emacs environment is implemented as libraries in emacs Lisp. Any and all of this functionality can be extended by using the advice facility. Emacspeak takes advantage of this feature to speech-enable applications by advising them how to speak. Speech-enabling extensions are a small fraction of the size of the application being speech-enabled; see Table 2.
Example 1, from the Emacspeak core, illustrates how the advice facility can be used. Function next-line is a user-level emacs Lisp command that is executed to move the cursor to the next line. Typically, it is invoked by the user pressing the down arrow key. Emacspeak speech-enables command next-line to speak the current line. This is achieved by providing an after advice to function next-line. As an after advice, this fragment of code gets called after every call to function next-line. It first checks to see if the function was invoked due to a user action, then calls function speak-line to speak the current line. Function speak-line is provided by the Emacspeak core.
Function speak-line extracts the text on the current line and hands it to function speak. Function speak preprocesses the text and passes it to the speech server, where the preprocessed text is queued for conversion to audible output. Function speak's preprocessing involves breaking the text into appropriate chunks based on the content of the text being spoken. Thus, when speaking the line "Urgent: Please reply to this message," while in the context of reading e-mail, the text to be spoken is queued to the speech server as two separate chunks: "Urgent" and "Please reply to this message." Based on the rules of English grammar, Emacspeak makes each clause a separate chunk. Queuing speech in clauses produces good intonational structure in the output.
In contrast, the rules of English grammar do not apply to speaking C code. When speaking the line "context->speak(text);", function speak breaks the text at the various C operators to produce the correct intonation.
The speech server is itself run by Emacspeak as a subprocess of emacs. Emacs communicates with this server via a bidirectional pipe. After all preprocessing is complete, the various chunks of speech are queued to the speech server by calling the built-in emacs function process-send-string. At this point, the speech server receives the corresponding request on its standard input and takes over.
Advice fragments provided by Emacspeak range in simplicity from Example 1 to more complex advice for applications such as the spell checker described in my book. However, all advice provided by Emacspeak follows the same pattern. The typical advice fragment does the following:
Emacspeak has successfully speech-enabled all of emacs. The implementation is a model of code reuse; Emacspeak's code size is less than 1 percent of the code in GNU Emacs. The speech functionality enabled by Emacspeak is also far superior to that provided by traditional screen-reading systems because speech is generated by the application itself, rather than by a separate program that reads and speaks the display. Emacspeak is able to provide full context-sensitive feedback. This rich environment has enabled the investigation of innovative speech interfaces to a variety of versatile user applications ranging from web browsers to spreadsheets.
Gibbs, Wayt. "Envisioning Speech," Scientific American, September 1996. http://www.sciam.com/0996issue/0996profile.html.
Mynatt, E.D., W.K. Edwards, and K. Stockton. "Providing Access to Graphical User Interfaces," Proceedings of The First Annual ACM Conference on Assistive Technologies, November 1994.
Raman, T.V. "Audio System for Technical Readings." PhD Thesis, Cornell University, May 1994. http://cs.cornell.edu/home/raman/.
-- -- -- . "Emacspeak: Direct Speech Access," Proceedings of The Second Annual ACM Conference on Assistive Technologies, April 1996.
-- -- -- . "Emacspeak: A Speech Interface," Proceedings of CHI96, April 1996. http://cs.cornell.edu/home/raman/publications.
-- -- -- . Auditory User Interfaces: Toward The Speaking Computer, Kluwer Academic Publishers, August 1997.
-- -- -- . "Netsurfing Without a Monitor," Scientific American, March 1997. http://www.sciam.com/0397issue/0397raman.html.
Thatcher, James. "Screen reader/ 2: Access to OS/2 and the Graphical User Interface." Proc. of The First Annual ACM Conference on Assistive Technologies, November 1994.