Supported by Digital Ocean
Sponsor: Digital Ocean

Dream it. Build it. Grow it. Sign up now and you'll be up and running on DigitalOcean in just minutes.

CHI ’24 Paper: ‘Apple’s Knowledge Navigator: Why Doesn’t That Conversational Agent Exist Yet?’

I’m a massive fan of Apple’s 1987 Knowledge Navigator concept video. Like other tech nerds, I often filter technology advancements through the lens of that vision: How close are we to that future?

Much of what it anticipates has come to pass in the ensuing four decades—video streaming, touchscreens, globally connected computers, wireless networking, and more.

Even some portions of the most fantastical and oft-discussed aspect of the video—the human-like digital assistant, Phil—are possible today; for example, Phil’s ability to summarize vast amounts of dataunderstand the spoken word, or speak in a voice that’s virtually indistinguishable from human.

However, the core of the video—where a professor has a human-like conversation with his digital assistant, which can anticipate needs and act autonomously on the professor’s behalf—well, we’re not quite there yet.

This fascinating research paper (PDFvideo summary) attempts to answer the questions I’ve often asked myself: Why aren’t we there yet? What’s preventing us from having a “conversational agent” like Phil? Is it purely technological limitations, or are there other issues at play?

What struck me the most about this paper was the systematic approach the authors took to identify the nature of the interactions between the professor and Phil: What is Phil’s role at any given moment? Is it proactive, interruptive, collaborative, or passive?

The researchers looked at every verbal exchange between the professor and his digital assistant, then identified what those exchanges represent and how various “constraints” might prevent or delay the implementation and adoption of conversational agents today.

The authors applied three frameworks to analyze the interactions between the professor and Phil, and using these frameworks, they captured “dialogue, actions, and agent capabilities” and identified “events” that were:

[…] feasible and common today, feasible and not common today, or not feasible today. Feasibility was determined by comparing the demonstrated agent capabilities to those of widely adopted agents like Apple’s Siri and to current trends in HCI [Human Computer Interaction] research and development. These characterizations were then used to consider why the Phil agent differs from today’s personal digital assistants.

From this effort, they identified

[…] a list of 26 agent capabilities, such as “Knowledge of contacts and relationships” (e.g., Mike’s mother) and “Can accurately extract data from a publication” (e.g., Phil summarizes the results of an academic paper using a graph).

Those 26 agent capabilities were condensed into nine broad capabilities—knowledge of user history, knowledge of the user, advanced analytic skills, and so on. For each of those, they focused on two actionable categories (“currently feasible but not common today” and “not currently feasible”).

For me, these “agent capabilities” and their feasibility were the most intriguing part of the study. When Apple announced Apple Intelligence last June, I did a very naïve version of this with their demos. I wish I was familiar then with the frameworks and methodology this paper used! 

Back to the paper…. The nine broad capabilities were then:

[…] tagged with constraints that restrict their adoption or development […]. Some were based on the user, such as trust or privacy, and some were based on available technology itself. The authors used categories similar to those used in previous studies of barriers to technology adoption to group the constraints into three user-centered categories (privacy, social and situational, trust and perceived reliability), and one technology category.

Those “constraints” are effectively reasons why it may be difficult—or impossible—to develop and deploy a “conversational agent” today. A few reasons, from my perspective:

  • Technology. With today’s available systems, two-way conversation with a digital assistant is just barely possible, and the ability for those systems to interrupt a conversation to provide answers to unasked questions is nonexistent. We’re barely at a level where questions we ask are answered correctly. While tools enabling autonomous behavior are coming online, they remain highly experimental and limited in scope.
  • Privacy. Phil seemingly knows everything about Professor Mike, and is actively listening at all times. There remains an inherent distrust of systems that are “always on.” Many consumers today already (wrongly) believe their Siri, Google, or Alexa devices are “spying” on them. It will take a concerted effort to convince customers to allow the kind of constant, active listening that’s demonstrated in the video.
  • Trust and Reliability. The professor implicitly trusts Phil to accomplish assigned tasks, without requiring confirmation, and without concerns about Phil’s ability to successfully complete those tasks. Even with an extremely competent human assistant, such trust can take significant time to establish, and a single mistake can erode that trust completely. Meanwhile, we’ve all had that experience where “Call mom” becomes “Call Ron”, or our dictated text message shows up as unintelligible on the other side. We’re currently a long way from trust and reliability.
  • Social and Situational. Phil and the professor primarily communicate verbally. While they appear to be in a large, private (home) office, making these interactions feasible, it would limit functionality in most other spaces, including traditional offices, libraries, and classrooms. Even with headphones, talking out loud “to yourself”—even if others are doing the same thing—might be a stigma that’s difficult to overcome. Apple introduced “Type to Siri” in iOS 18, partly as a way to allow non-verbal users to benefit from Siri, but also with an explicit acknowledgment that “out loud” communications are frowned upon in some environments.

My takeaway from the paper is that while (much) improved technology is a necessary component to enable conversational agents, it is not sufficient. Overcoming the technical hurdles does not immediately bring us the kind of human/digital assistant engagement we see Knowledge Navigator. Even if there’s an unanticipated technological leap forward, the other three constraints remain as significant barriers to the introduction and eventual adoption of a Phil-level agent.

Technology, it seems, is the easy part.

⚙︎