Voice-Driven User Interfaces

Following up on my first post on my Interfaces and Interactivity exam, here is my answer to the second essay question, which read: ‘So-called ‘voice-driven information services’ have been a notable recent innovation in smart phones, particularly the Siri app on the iPhone. Give an account of the state of the art in this interaction technology, explaining what it is capable of, what its limitations are, and how it works in generic terms.’

Mobile devices have greatly increased in their pervasiveness and use in recent years, and users are to a large extent relying on their smart phones for online queries and task management. For instance, it is predicted that the majority of Internet connections in eight years will be via a mobile device (Rainie, Anderson, & Fox, 2008), illustrating the ubiquity of information availability. As keyboard entry may be cumbersome while on the move, voice-driven user interfaces (VUIs) – using speech as input and output for user interaction – has recently been explored and implemented as a natural step further for enhanced usability. This essay will first describe the operations and merits of VUIs within natural language user interfaces in general. It will further describe and evaluate two examples of smart phone VUIs, the Google Mobile App and the Siri app on the iPhone 4GS, as these are commercial, state of the art implementations with marked popularity. Finally, usability challenges and considerations for further VUI development will be considered.

A VUI is a form of natural language user interface, whereby the system engages in 2-way communication with the user through speech recognition and synthesis (Rogers, Sharp, & Preece, 2011). Speech is recognised by the system capturing and analysing acoustic signals using various methods, such as statistical language models (Oviatt, 2008). Synthesis occurs by for example combining voice sounds from a recorded database according to a computational algorithm, or by applying phonological rules to text input, which is subsequently passed through a synthesiser (Oviatt, 2008). While the former method may sound more natural, the latter method is more intelligible to the listener. Until just a few years ago, VUIs were highly limited, constraining e.g. automated phone transaction tasks to tedious and one-sided interactions; however, they have recently become sophisticated in their expression, i.e. less artificial, and more accurate in their recognition (Rogers et al., 2011).

This advancement in VUI technology makes it a promising method of addressing the limitations of requiring human visuomotor resources for mobile device input. As mobile devices have evolved from clumsy and effortful mechanical interaction, e.g. the plastic keyboards of earlier mobile phones, towards more seamless, intuitive and direct interaction style via touch interfaces in today’s smart phones; voice-driven interfaces seem to be the next natural step. Language is a faster and more natural human communication mode than typing, and has a well-established conceptual model in users, thus speech may allow users, particularly novices, to interact with a system in an already familiar manner (Rogers et al., 2011). It is important, however, to avoid speech-only interfaces by combining them with visual information: speech is transient and often ungrammatical, which may be problematic for lengthy, complex interactions. Multimodality, therefore, should be implemented to ensure information permanence and disambiguation, as well as reduce the time and cognitive effort involved in unimodality (Rogers et al., 2011).

The Google Mobile Application (GMA) incorporates Google Voice Search, which is an example of a multimodal VUI for smart phones that has achieved wide use and success. To use GMA, the user opens the app on their smart phone, either gestures the phone to the ear (activating the phone’s accelerometer) or taps the microphone button on the interface, waits for a beep, and speaks (Google Voice Search, 2011). GMA’s Voice Search aims to recognise any spoken query and handle anything the desktop version of Google Search is capable of (Schalkwyk et al., 2010). A key advantage of GMA, differentiating it from earlier mobile phone VUIs, is that voice search is entirely user-initiated, adhering to usability principles of flexibility and user control (Shneiderman et al., 2008). Moreover, it is available on a wide range of platforms, including iPhone, BlackBerry and Android phones, which may increase interface familiarity. Google user studies have revealed search patterns indicating Voice Search is frequently used before text search for on-the-go topics, such as local food and drink establishments; and less frequently used for potentially sensitive topics, e.g. health, illustrating VUIs limited usability in public spaces with regards to privacy. These user studies have the substantial advantage of leveraging Google’s constant access to data from real users of their services and investments in vast computational resources to drive innovation.

However, the GMA has several challenging UI design issues that remain to be resolved given the novelty of mobile device VUIs. An example is signalling speech endpoint: GMA on Android automatically ends at points of silence, whereas GMA on BlackBerry utilises their physical buttons by having the user manually signal their speech endpoint (Schalkwyk et al., 2010). Arguably, the former is convenient for short voice input, e.g. voice commands; whereas the latter tolerates longer breaks in speech input, thus an integration of these on the same platform should be implemented. Further, the gesture-based speech trigger implemented in GMA for iPhone is found to be used in one third of voice searches: while signalling that users enjoy this feature, many are not aware of it (Schalkwyk et al., 2010). An internal user study at Google revealed that even those who are aware of the feature prefer receiving the visual information from the screen as well, which cannot be done while gesturing. User study also revealed that the option to display alternative recognition hypotheses by selecting the drop-down menu on the results screen is rarely used by users (Schalkwyk et al., 2010). It is possible that switching modalities from voice to manual tapping is too cumbersome, and that users would rather repeat the voice query. Design faces the dilemma of highlighting visibility of alternatives, while avoiding interface clutter. Finally, interaction with GMA does not entirely mimic natural language, as optimal performance relies on using particular command shortcuts, e.g. ‘call’ and ‘directions to’.

The Siri software introduced with the iPhone 4GS in 2011, on the other hand, makes an attempt at limiting the constraints of the natural language interaction with increased power and functionality. While GMA does provide personalised speech recognition by recording previous voice searches for a more accurate speech model, Siri takes this a step further by harnessing knowledge about the user and their context, providing responses with both user personalisation and inherent personality. While GMA focuses on Internet search, Siri emphasises further functionality, such as creating reminders. It was for instance found that it takes 3 seconds to create a reminder with Siri, compared to the 10 seconds it takes to type an event into the iPhone calendar (Wired, 2011). Further advantages include the automatic activation of dictation by holding the phone up to one’s ear, similar to GMA’s gesture-based recognition; and its flexible capabilities, e.g. responding with nearby mental health organisations when the user says he wants to kill himself (Wired, 2011). Siri does have its shortcomings, however, for instance not understanding commands such as ‘lower screen brightness’ and not integrating reminders into the calendar application, requiring the two applications to be open simultaneously (Wired, 2011). Moreover, a study found that novice users take longer learning Siri compared to GMA and similar search-focused VUIs (VanDuyn, 2010). This study was limited, however, in that it only included two users and an early version of Siri.

As interface design should address limiting users to as large an extent as possible, it is fortunate that VUI may enhance the mobile device experience for disabled users and users in developing world regions. For instance, users with visual impairments or mobility impairments could greatly benefit from voice interaction (Wired, 2011). A caveat is that development towards VUIs as a primary mode of mobile device interaction may alienate speech and hearing impaired users. This emphasises the importance of maintaining multimodality in the VUI. Language and literacy is often a barrier to information services access in developing world regions, in which mobile phones are the most widespread form of ICT, making this a promising arena for mobile device VUIs (Botha, Calteaux, Herselman, Grover, & Barnard, 2012). Challenges to address include consideration of a wide range of mobile and voice technology experience and varying degrees of literacy among the user population. Moreover, such regions have large variability in local languages and accents, which may not be implementable in the VUIs. For both disabled users and users in developing countries, early and continuous involvement of real users in development is key.

In conclusion, VUIs provide a promising new arena for mobile device interaction, key advantages including intuitive learning by novice users by leveraging an established conceptual model, and freeing cognitive resources and information processing modalities up for other tasks via multimodality. The GMA and Siri are two commercially successful implementations of VUIs for smart phones, found to enhance the user experience by providing flexibility and user control, and easy access to on-the-go information. As VUIs are a still a relatively novel domain within human-computer interaction, however, UI design challenges remain to be optimised, such as how to collect user utterances and learnability. Nevertheless, mobile device VUIs may have large benefits for a wide user population, including disabled users and users from developing countries.


Rainie, L., Anderson, J.Q., & Fox, S. (2008). The Evolution of Mobile Internet Communications. In The Future of the Internet, v3. Cambria Press.

Rogers, Y., Sharp, H., & Preece, J. (2011). Interaction Design: Beyond Human-Computer Interaction (3rd Ed.). Wiley.

Oviatt, S. (2008) Multimodal interfaces. In J.A. Jacko & A. Sears (Eds) Handbook of Human-Computer Interaction, 286-304. L. Erlbaum Associates Inc.

Google Voice Search (2011). Retrieved from http://www.google.com/mobile/voice-search/ on 2 May 2011.

Schalkwyk, J., Beeferman, D., Byrne, B., Chelba, C., Cohen, M., Garret, M., & Strope, B. (2010). Google Search by Voice : A case study. In A. Neustein (Ed.) Work, 1-35. Springer.

Shneiderman, B., Plaisant, C., Cohen, M., & Jacobs, S. (2009). Designing the User Interface: Strategies for Effective Human-Computer Interaction (5th Edition) (p. 624). Addison Wesley.

Wired (2011). With Siri, the iPhone Finds Its Voice. Retrieved from http://www.wired.com/reviews/2011/10/iphone4s on 2 May 2011.

VanDuyn, I. (2010). Comparison of Voice Search Applications on iOS. Retrieved from http://www.isaacvanduyn.com/voice_project.shtml on 2 May 2011.

Botha, A., Calteaux, K., Herselman, M., Grover, A. S., & Barnard, E. (2012). Mobile User Experience for Voice Services : A Theoretical Framework. In V. Kumas, Jokob Svensson (Ed.), M4D2012, 1-10. New Delhi: Karlstad University.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s