Thanks to everyday artificial intelligence like Alexa, Siri, and Google Home, users have come to expect to use simple sentences and queries to communicate with their devices. But while you can give your smart home device a simple statement to find out about weather, movie times, or other benign things, that same ease of communication hasn't translated into the industrial space, where more and more collaborative robots (cobots) are working alongside humans.
In a new paper, presented at the International Joint Conference on Artificial Intelligence (IJCAI) in Australia in August, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) described their work developing a system called “commands in context” (ComText) that allows robots to understand a wide range of commands that require contextual knowledge about the world around them.
In their study the researchers loaded ComText onto a Rethink Robotics Baxter robot and had it perform a variety of tasks after receiving verbal cues and commands. The researchers would place objects (like a box of Cheez-Its) onto a table and tell the robot, “This is my box.” From there when told, “Pick up my box” the robot is able to find the correct object and pick it up.
The CSAIL team released a video of some of their tests:
According to their paper the robot was able to perform the correct task 92.5% of the time by combining prior knowledge from verbal statements with visual observations of its workspace. What's more the robot is able to accrue knowledge over time and further adjust its actions. Telling the robot, “The box and the can are my snack,” for example will result in in picking up both objects when it is told, “Pick up my snack.”
“Where humans understand the world as a collection of objects and people and abstract concepts, machines view it as pixels, point-clouds, and 3D maps generated from sensors,” Rohan Paul, a CSAIL postdoc and one of the lead authors of the paper, told MIT. “This semantic gap means that, for robots to understand what we want them to do, they need a much richer representation of what we do and say.”
The difference, according to Paul and his team, lies in understanding that language is a “grounded” experience, that is, the meanings of words can change given context or sensory-motor cues. You can look up a term like “heavy” in the dictionary to understand what it means, but “heavy” means entirely different things to, say, the average person versus a professional Crossfit athlete. In other words, we have to interact with the world to really grasp what certain words mean. “Grounding a command necessitates a representation for past observations and interactions; however, maintaining the full context consisting of all possible observed objects, attributes, spatial relations, actions, etc., over time is intractable,” the paper said.
The researchers found a solution in a time-based model, called Temporal Grounding Graphs, that is able to understand the context of verbal utterances. A key limitation of current models is that they assume the world is static -- a not entirely unreasonable assumption given that many robots are, in fact, locked into static environments in which they perform the same repetitive tasks over and over. However, greater collaboration with humans will requires robots to have a better contextual understanding. Right now a human worker could ask a cobot to, “Hand me my wrench” but unless that wrench is in a designated place, matches what the robot already knows of a wrench, and is not surrounded by similar wrenches, the robot will have a good deal of trouble with the task. “Even a simple statement such as 'The fruit I placed on the table is my snack' followed by the command 'Pack up my snack' requires reasoning about previous observations of the actions of an agent,” the paper said.
|Using ComText a Baxter robot was able to identify and interact with objects based on verbal declarations and visual observations. (Image source: MIT CSAIL)|
Granted, this demonstration does not represent a fully seamless verbal interaction with a robot. Ideally a system would have to allow a robot to infer things based solely on visual cues and actions around objects. Your coworker can discern a tool belongs to you by virtue of watching you use it. There's no need for a verbal declaration. The team from CSAIL is hoping further iterations of its model will achieve results like this as well as more complex interactions.
“We intend to extend this model to ground to sequences of actions and collections of objects, to engage in dialog while executing multi-step actions, to keep track of and infer the locations of partially observed objects, and to serve as a basis for a grounded and embodied model of language acquisition,” the paper said.