Digital assistants – why is being smart so hard?

Johannes-Stiehler-4 Digital assistants - why is being smart so hard?
Johannes Stiehler, CTO, ayfie

Interacting with computers has always been clunky and indirect, and today’s digital assistants are no different. They misunderstand, misinterpret and malfunction. But why is it so hard to make this most promising way of interacting with a machine work?

The answer lies in the interaction between the different modules required for a smart assistant. When we unravel the steps Siri, Alexa, Cortana & Co. need to take to execute a simple command, we find that the ease with which our brain solves problems or responds to commands is not so readily replicated. There are three difficult problems at the root of many issues within natural language processing (NLP) and the digital assistants that depend on this technology: variance, ambiguity, and context.

Digital assistants aim to remove the confusion of how to get a computer to execute a specific task. They do this by making the interaction between human and machine as similar as possible to the interaction between two humans. That is accomplished with natural language.

But the challenges underpinning this technology are evident in a simple request: “Schedule a meeting with Peter on Friday.” In order for an assistant to act on this request, three things need to happen:

  1. Speech to text: The audio signal needs to be recorded, digitized and then transformed into a text representation or sequence of words.
  2. Text to semantics: The words need to be analyzed for syntactical structures and meaning and transformed into some form of semantic representation.
  3. Semantics to action: The semantic representation must be interpreted as a sequence of operations to be performed.

A mistake at any step in this process will cause the digital assistant to misunderstand the user’s intent and do the wrong thing. A human is able to make semantic predictions about the speaker’s intention and fill in the gaps. The digital assistant can make no such predictions or assumptions. It relies solely on math and a large database of speech and corresponding text representation, using machine learning algorithms to build a model for recognizing words from sounds. It uses statistics (not semantics and context) to make predictions, so it knows that it is much more likely that “schedule a” will be followed by “meeting” and not by “meat thing.”

The digital assistant is left with a sequence of words that, at this point, have no inherent relationship or meaning. It has no idea what the sentence actually means. And that is where the challenges facing NLP become abundantly clear:


Speech to text is a process that comes so natural to humans that it can be difficult to understand why digital assistants have not yet mastered it. But consider variations in speech. Each individual uses slightly different vocabulary and expressions, with a completely unique pronunciation. There are a hundred ways to express the desire to set an appointment with a person: “I want to have a meeting …,” “schedule an appointment…,” “put something on my calendar.” All of these phrases share the same semantics and should be interpreted the same way.


While one semantic fact can be expressed with hundreds of simple sentences, one expression can refer to millions of semantic entities. Which Peter should the meeting be scheduled with? If there is only one Peter in the user’s electronic address book, perfect. But most ambiguity is not so easily resolved unless you ask the user.


Since context comes so naturally to us, we often don’t recognize it as something that requires clarification. We are not aware that at any given moment we are integrating an unimaginable number of nested and independent contexts into how we interpret what others are saying. A smart assistant lacks the capacity for context, and thus interpretations often vary from our intention.

These challenges are magnified with broader application. Smart assistants were not designed just to schedule meetings. People expect them to do a thousand things. This is no easy feat. “What can I cook this evening,” “When did Lou Reed die” and “Do you want to marry me” are all questions that require completely different, and sometimes very complex, sets of actions.

So, while you can have an intelligent conversation with your human assistant, it is not yet the case with your digital assistant. Language comprehension is infinitely more difficult than the narrow interactions that digital assistants are capable of today. But even this narrow interaction proves more and more useful each day – think greasy fingers, looking for a recipe or hands on the steering wheel, trying to read a text message. It is in fact reassuring that in a world where everything needs to be break-through and disruptive, digital assistants will continue to make an impact just with steady incremental progress.

About Johannes Stiehler

Johannes Stiehler has worked with advanced linguistics and information retrieval for nearly 20 years. Stiehler holds a master’s degree in computational linguistics, computer science and romance languages from the renowned Ludwig-Maximilians-University in Munich. He currently serves as the CTO at ayfie.