Why Voice Is Just A Bridge

by Srikar Kalvakolanu

Voice UI is meaningful, but is it really “disruptive”?

Voice as an alternative UI has gained a great deal of traction in the past 3 years. In fact, many people have called 2017 the year of voice. This has led to massive investment in voice as a platform (Alexa, Google Home, Siri, Bixby, etc.). However, in all of this investment and creation, it’s important to understand context of what voice UI is and what it is replacing.

Voice As An Input

Voice is primarily an input mechanism. As with most mathematical functions, you take in an input (a command, text, voice, etc.) and then put it through some sort of mechanism that interprets and uses that information, and eventually it outputs what you want. In this case, voice is often just a simple input displacement mechanism. We are shifting from a button click or typing to using voice to create a similar input using technologies such as Natural Language Processing.

Voice is primarily an input function that replaces current modes (text, button clicking, etc.).

There’s no denying that voice is often easier or at least has less “friction” than manual input (in some cases). For example, think about the weather. If you have an Alexa, you can simply ask, “Alexa, what’s the weather like today?” Whereas manually, you’d have to take out your phone and find the Weather app or Google it. And although that may seem like a small difference, it is often easier. However, consider the whole picture, the output is often similar. One is visual vs auditory, but the same information is available and is being served up both times.

Voice As A Disrupter

The word disruption is thrown around a lot. It’s almost one of those buzz words like “AI” (that’s a whole article to itself) that seemingly everyone likes to say, but rarely actually does anything meaningful with. I see disruption as something that fundamentally alters either the input, function, or output mechanism of a process so significantly that it pushes other modes directly out. Now, yes, voice is doing some of that, but it doesn’t do it with the sort of ubiquity and sense of shift that other truly disruptive technologies have.

I’ll go further and argue that it potentially never will. Voice simply isn’t the optimization of this “input > function > output” mechanism (not to say that there truly is an “optimal” state). By and large, the “optimal” state of this traditional mechanism isn’t to make the input itself easier (which voice often doesn’t or can’t do), but it is to eliminate the amount of input needed or eliminate the input all together.

Voice still exists in the same mechanic as chat bots, SMS assistants, and others that are still highly input-based, but the true disruption is one that understands context and requires no input whatsoever. That stage is coming in the form of AI, Machine Learning, and Coaching Networks, but is still fairly nascent today.

A World of “No Input” and Disruption

Now, the input-less world is unique in the sense that it requires the function to adapt to being able to 1) find relevant data; 2) contextualize the relevant data; and 3) use the relevant data to create a meaningful output. This shifts the model a bit from the input not being eliminated per se, but just being automated without human operation.

A new paradigm would be adapting the function to include gleaning inputs

For example, a computer may need to order your breakfast for you at McDonalds by the time you get there. It may have to find that you’re about to leave your apartment and know that an Egg McMuffin takes 3.2 minutes to make, your commute is 5.1 minutes, and you also like your coffee in the morning, so it gathers all of this information, contextualizes the data, and sends it to McDonalds so it can make your breakfast right on time.

Now, you can also tell your phone that you’re leaving and input the McDonalds address and what you want, but the true magic is in the machines understanding that already.

This is clearly a ways away in terms of our ability to create applications and UIs that work in this way, but that is the true optimal stage. Thinking about this begs the question of whether or not voice really has a role in that equation. It might still be the stop-gap between today and full automation—Iron Man’s JARVIS has convinced me that it would be pretty awesome, but in the long game, it’s about balancing these two to get ready for the future.

For most companies investing in voice, it requires a stern discerning of working on the voice platform, but balancing that with the ability to remove inputs to get ready for the AI/automation world. For those who go all in on voice, it may pass by quickly, so it’s worth building your product strategy and roadmap around the long-term and more sustainable future that is automation.