Thanks to everyday artificial intelligence like Alexa, Siri, and Google Home, users have come to expect to use simple sentences and queries to communicate with their devices. But while you can give your smart home device a simple statement to find out about weather, movie times, or other benign things, that same ease of communication hasn't translated into the industrial space, where more and more collaborative robots (cobots) are working alongside humans.
In a new paper, presented at the International Joint Conference on Artificial Intelligence (IJCAI) in Australia in August, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) described their work developing a system called “commands in context” (ComText) that allows robots to understand a wide range of commands that require contextual knowledge about the world around them.
In their study the researchers loaded ComText onto a Rethink Robotics Baxter robot and had it perform a variety of tasks after receiving verbal cues and commands. The researchers would place objects (like a box of Cheez-Its) onto a table and tell the robot, “This is my box.” From there when told, “Pick up my box” the robot is able to find the correct object and pick it up.
The CSAIL team released a video of some of their tests:
According to their paper the robot was able to perform the correct task 92.5% of the time by combining prior knowledge from verbal statements with visual observations of its workspace. What's more the robot is able to accrue knowledge over time and further adjust its actions. Telling the robot, “The box and the can are my snack,” for example will result in in picking up both objects when it is told, “Pick up my snack.”
“Where humans understand the world as a collection of objects and people and abstract concepts, machines view it as pixels, point-clouds, and 3D maps generated from sensors,” Rohan Paul, a CSAIL postdoc and one of the lead authors of the paper, told MIT. “This semantic gap means that, for robots to understand what we want them to do, they need a much richer representation of what we do and say.”
The difference, according to Paul and his team, lies in understanding that language is a “grounded” experience, that is, the meanings of words can change given context or sensory-motor cues. You can look up a term like “heavy” in the dictionary to understand what it means, but “heavy” means entirely different things to, say, the average person versus a professional Crossfit athlete. In other words, we have to interact with the world to really grasp what certain words mean. “Grounding a command necessitates a representation for past observations and interactions; however, maintaining the full context consisting of all possible observed objects, attributes, spatial relations, actions, etc., over time is intractable,” the paper said.
The researchers found a solution in a time-based model, called Temporal Grounding Graphs, that is able to understand the context of verbal utterances. A key limitation of current models is that they assume the world is static -- a not entirely unreasonable assumption given