It’s very telling that this subtle sense of when to pause and when to yield, when to start new threads and cut old threads, is something in many cases
excluded from bot conversations.

Decision Problems

It is this ballet and negotiation of timing that linguists and programmers alike have kept out of their models of language, and it is precisely this dimension of dialogue in which words like “uh” and “um” play a role. “Speakers can use these announcements,” linguists Clark and Fox Tree write, “to implicate, for example, that they are searching for a word, are deciding what to say next, want to keep the floor, or want to cede the floor.”

We are told by speaking coaches, teachers, parents, and the like just to hold our tongue. The fact of the matter is, however, filling pauses in speech with sound is not simply a tic, or an error—it’s a
that we’re about to speak. (Consider, as an analogue, your computer turning its pointer into an hourglass before freezing for a second.) A big part of the skill it takes to be a
contestant is the ability to buzz in
you know the answer, but as soon as you know you
the answer—that buzz means, roughly, “Oh! Uh …,” and its successful deployment is part of what separates champions from average players. (By the way, this is part of what has been giving IBM researchers such a hard time preparing their supercomputer Watson for serious competition against humans, especially for short questions that only take Alex Trebek a second or two to read.)

In 2000, MIT researchers Nikko Ström and Stephanie Seneff presented at Interspeech a paper which found that leveraging the word
“um” could make spoken-dialogue telephone menus much easier and more intuitive to use. At the 2009 Interspeech conference in Brighton, a group of four researchers from Kyoto University presented findings at the poster session to the effect that in a number of situations, the timing of human speech offers a computer system more information than the content.

In part, computer programs’ history of not dealing well with questions of timing goes back to the original theories about what computer programs
. As formalized in the early days of computability theory, programs performed what are known as “decision problems.” The idea was that you would feed a program a given input, specifying where the input began and ended. The program would then process that input, taking however much time it happened to need, at the end of which it would output a clearly defined result.

This model of language, this presumed one-to-one mapping of “inputs” to “outputs,” leads to all sorts of terrific opportunities for mischief making, I discovered, when it comes to conversing with chatbots. ELIZA’s pattern matching, for instance, isn’t sophisticated enough to consider whether the pattern it’s trying to match is one utterance or many. When I begin a conversation as follows—

I am Brian.

Do you enjoy being Brian?

—and reply “I, er, guess so,” I get back the noncommittal response “I see.” But if I—as many instant-message users do—break that response up across multiple carriage returns, chaos ensues:



guess so

Tell me more …

Tell me more …

Tell me more …

Surprisingly, the rigidity of the input-output binding is even stronger in many of the newer chatbots. If you talk to Cleverbot, 2005 and 2006 Loebner Prize winner, at its website, each time you hit the enter key, the input box locks and cannot be clicked or typed into until Cleverbot has finished responding:

oh hi

How are you?


Is that good or bad?



If you talk to Elbot, the 2008 Loebner Prize winner, online, each time you hit the enter key the screen refreshes, and one of Elbot’s responses is visible:


I’m starting to wonder: Who convinced you to come visit me?



um …

That would be a great name for a robot!

With a number of customer-representative-type bots, like the ones at,, and, the input screen locks into a “Please wait …” message after each carriage return:


That’s ok, what else can I help you with?


I’m not sure how to answer that. Please try rewording your question.


Children traveling alone are welcome on Alaska Airlines and Horizon Air. Here’s information regarding our Unaccompanied Minor service.

It’s fascinating that for all the trenchant questioning that many of these programs can deal with, mere mumbling destroys them.

Leveraging the Medium

I often judge a work of art by how necessary it is for that work to be in that medium. In part, that means how much it leverages the medium’s unique capacities: a symphony, of course, can do things that an essay can’t, and vice versa.

Different written media, for instance, leverage their unique characteristics with time to produce different brands of intimacy, different communicative possibilities—the quick-moving text message saying one was
just now
being thought of and producing the empathic delight of a moment shared across space between two minds; the slow-moving postal letter or handmade gift saying that one was being thought of
at length without knowing it
, imbuing future days with the sheen of possibility.

In some sense the story of instant messaging—the medium of the Turing test—is the story of the telegram, accelerated to the breaking point.

I discover, though, that the protocol the Loebner Prize was using in 2009 was unlike emails, text messages, and standard instant-messaging systems in a very crucial way. The Loebner Prize chat protocols transmitted typing
keystroke by keystroke
. You’re watching each other type, typos and backspacing and all.

Part of what I needed to figure out was how exactly to use the Loebner Prize competition’s unusual “live typing” medium. What did this enable and disable, compared to the standard walkie-talkie, telegram-like, turn-taking style?

To make typing visible also, then, makes typing’s “negative space” visible: hesitation. In a chat conversation where text is transmitted with every carriage return, only egregiously long pauses are taken to be “part” of the interaction. With more fluid and immediate feedback, silence acquires meaning. Failing to quickly answer a question in a face-to-face conversation, for instance, is tantamount in many cases to answering it. I recall asking a friend of mine how things were going with the girl he’d started seeing; the tiny pause and “um” in which he searched for the right words made it clear they were having problems. In other words, it wasn’t his choice of words, but the choosing
, that told the tale.

I remember some Internet chat programs back in the ’90s trying out this character-at-a-time approach, but people for the most part rejected it. It was too
, was the feeling: what people like about writing is the time and space to compose and edit before sharing it with the other person. The advantage of the character-at-a-time transmission, though, is that it starts to approach much more closely the condition of speech, with its fluidity of turn-taking and its choppy grammar, where what’s lost in eloquence is made up for in agility.

Long missives weren’t going to work, as they have in certain years, where programs were able to steamroller the judges by eating up the clock and delivering ridiculously prolix answers. If two parties are taking strict turns under time pressure, they are putting themselves at the mercy of the length of the other’s turns—here, for instance, is one program from 1996:

How about telling me a joke?


Here’s my favorite joke … A dog limps into a bar with a bloody bandage on his foot. He’s wearing a 10-gallon hat, six-shooters in his holsters and chaps. He growled, “I’m a-lookin’ for the man that shot my paw.”

“The joke’s not funny …,” the judge writes, seemingly prompting the program to tell another one—which it does (“A knotty, worn-out old string walks into a bar …”). Meanwhile, three full minutes have elapsed.

Confederates, too, sometimes err on the epistolary side, as did a confederate in 1992, speaking on the topic of “Health”:

well i generally eat various garbage that happens my way. to say i have a program would be really pushin it. Doritos are far and above my favorite feel good and groovy chow.

You have to be careful with garbage. That brings out another point. Natural health is about intuition to a great extent. When you eat too much garbage, you have to be careful of maintaining balance by your own intuition because sometimes, if you don’t control it at this initial stage, you don’t know until you see the results in your health that you have been harmed. High cholesterol is perfect example of this, and we all know its results; clogged arteries that are then very difficult to remedy, and there is controversy, over whether it is reversible.

This judge voted her a computer.

If a computer (or confederate) started rambling on too long in a “barge-in-able” test, the judge would just cut them off.

I realized something else about the character-by-character transmission, and what that might allow. Sometimes spoken dialogue becomes slightly nonlinear—as in, “I went to the store and bought
milk and eggs, and on my way home I ran into Shelby—oh, and bread too,” where we understand that bread goes with “bought” and not “ran into.” (This is part of the function of “oh,” another one of those words that traditional linguistics has had no truck with.) For the most part, though, there is so little lag time between the participants, and between the composition of a sentence in their minds and their speaking it out loud, that the subject matter rarely branches entirely into two parallel threads. In an instant-message conversation, the small window of time in which one person is typing, but the other cannot see what’s being typed, is frequently enough to send the conversation in two directions at once:

how was your trip?

oh, and did you get to see the volcano?

good! how’ve things been back at the homestead?

oh, you know, the usual

yes we did get to see it!

Here the conversation starts to develop separate and parallel threads, such that each person’s remark isn’t necessarily about the most recent remark. It’s possible that watching each other type eliminates the lag that creates this branching, although I had reason to believe it would do something else altogether …

Talking simultaneously for extended periods simply doesn’t work, as our voice—emanating just inches away from our ears—mixes confusingly with our interlocutor’s in the air and makes it hard to hear what they are saying. I was fascinated to learn that the deaf don’t encounter this problem: they can comfortably sign while watching someone else sign. In large groups it still makes sense to have one “speaker” at a time, because people cannot look in more than one direction at a time, but conversations between pairs of signers, as Rochester Institute of Technology researcher Jonathan Schull observed, “involve more continuous simultaneous and overlapping signing among interlocutors” than spoken conversations. Signers, in
other words, talk and listen at the same time. Schull and his collaborators conclude that turn-taking, even turn negotiation, far from being an essential and necessary property of communication, “is a reluctant accommodation to channel-contingent constraints.”

One major difference between the Loebner protocols and traditional instant messaging is that, because the text is being created without any obvious ordering that would enable it to be arranged
on the screen, each user’s typing
in a separate area of the screen. Like sign language, this makes group conversation rather difficult, but offers fascinating possibilities for two-person exchange.

Another piece of my confederate strategy fell into place. I would treat the Turing test’s strange and unfamiliar textual medium more like spoken and signed, and less like written, English. I would attempt to disrupt the turn-taking “wait and parse” pattern that computers understand and create a single, flowing duet of verbal behavior, emphasizing timing: whatever little computers understand about verbal “harmony,” it still dwarfs what they understand about rhythm.

I would talk in a way that would, like a Ferneyhough piece, force satisficing over optimization. If nothing was happening on my screen, whether or not it was my turn, I’d elaborate a little on my answer, or add a parenthetical, or throw a question back at the judge—just as we offer and/or fill audible silence when we talk out loud. If the judge took too long considering his next question, I’d keep talking. I’m the one (unlike the bots) with something to prove. If I understood what the judges were writing, I’d spare them the keystrokes or seconds and jump in.

