|During the opening keynote of Google I/O, Google Duplex demonstrated making calls to book a salon appointment and restaurant reservation. (image source: Google / Google AI)|
Before you read any further, listen to the audio clip below:
If it sounds like an assistant making a salon appointment for her boss, you're right. But only one person in that conversation, the salon receptionist, is a human. The woman making the appointment is actually an AI assistant created by Google.
In 1950, Alan Turing proposed a test that would distinguish a machine's ability to exhibit intelligent behavior. The idea, put simply, is to have a machine have a conversation with a human in another room. If the human cannot tell he or she is talking to a machine, then the machine passes the test.
Based on the demonstration given at Google's I/O developer conference this week, it looks like Google's new AI, Google Duplex, passes the Turing Test with flying colors. Not only does it emulate a realistic-sounding human voice; it also captures vocal inflections—the uhmms and aaahs that are present in natural conversations.
During his keynote presentation at Google I/O, Google CEO Sundar Pichai also showed how the AI can adapt when conversations go awry because of a misunderstanding or language barrier:
Duplex isn't capable of carrying out general conversations, however. Rather than venture into such broad territory, where so much can go wrong, Google researchers opted to confine Duplex to the narrow domain of appointment scheduling for now.
Conducting natural conversations comes with a number of challenges that Google engineers needed to solve. “When people talk to each other, they use more complex sentences than when talking to computers. They often correct themselves mid-sentence, are more verbose than necessary, or omit words and rely on context instead; they also express a wide range of intents, sometimes in the same sentence...” wrote Yaniv Leviathan, Principal Engineer, and Yossi Matias, Vice President, Engineering at Google, in a blog post about Duplex's development. They added that people also generally talk to each other faster than they talk to a machine, which increases error rates. Conversations grow increasingly complex the longer they go on. Factor in that phone conversations can have all sorts of background noise and sound quality issues, and it's clear how deep the challenges can go.
Google engineers solved these problems by combining a recurrent neural network (RNN) with a combination of a concatenative text to speech (TTS) engine and a synthesis TTS engine. RNN's are popular for natural language processing applications. They function by having a sort of built-in memory that allows them to make predictions on data based on what came before. What this means in terms of text and speech is that the neural network, rather than inferring the meaning of each word in a sentence independently, is able to look at the word in the context of what was said before it.
In the context of making an appointment, let's say a receptionist says, “Okay, you're confirmed for five.” Five could mean the size of your party, the time of the appointment, or even something else like the number of items you're coming to pick up. RNNs are able to pick up on context and figure out the actual meaning of the number. “To obtain its high precision, we trained Duplex’s RNN on a corpus of anonymized phone conversation data. The network uses the output of Google’s automatic speech recognition (ASR) technology, as well as features from the audio, the history of the conversation, the parameters of the conversation (e.g., the desired service for an appointment or the current time of day), and more. We trained our understanding model separately for each task, but leveraged the shared corpus across tasks,” the Google blog says.
This is combined with a more sophisticated version of TTS technology. If you've used Google Assistant (said “Okay, Google” into your phone), you've already experienced a version of this. TTS samples voices recordings from a human model and recombines them to form natural sentences (at least from a grammatical standpoint). Less complex versions of TTS can sound rather stilted and crude (think Hal from 2001), but newer models are able to learn inflections and pauses from human speech as well.
Mix sophisticated RNNs and TTS models, and you get a system which not only knows what it's hearing and how to respond, but also how to do so while sounding as naturalistic as possible.
According to the Google blog:
“The system also sounds more natural, thanks to the incorporation of speech disfluencies (e.g., “hmm”s and “uh”s). These are added when combining widely differing sound units in the concatenative TTS or adding synthetic waits, which allows the system to signal in a natural way that it is still processing. (This is what people often do when they are gathering their thoughts.) In user studies, we found that conversations using these disfluencies sound more familiar and natural.”
According to Google, Duplex was trained using the company's TensorFlow platform, which is based on Google's Tensor Processing Units (TPUs)—proprietary processors developed by Google and designed specifically for machine learning tasks. Right now, the system can carry out many conversations on its own, but it also self-monitors and will alert a user when it cannot complete a task autonomously. To train the system on a new domain (i.e., hair appointments versus restaurant reservations), Google supervises the system training in real time, with a human operator overseeing the system. “By monitoring the system as it makes phone calls in a new domain, they can affect the behavior of the system in real time as needed. This continues until the system performs at the desired quality level, at which point the supervision stops and the system can make calls autonomously,” the Google blog says.
Pichai told the Google I/O audience that, done correctly, Google believes Duplex will create significant time savings for customers and businesses. However, Duplex isn't ready for an immediate, widespread rollout. In their blog, Leviathan and Matias said Google plans to begin testing Google Duplex within Google Assistant as soon as this summer.
Pichai also outlined a more immediate use case in his keynote. “For example, every single day, we get a lot of queries into Google wondering the opening and closing hours of businesses. But it gets tricky during holidays,” Pichai said. “So as Google can make just one phone call and update the information for millions of users and it'll save a small business a countless number of calls...this is going to be rolling out as an experiment in the coming weeks.”
Watch the full Google Duplex demo from Google I/O below: