Computer vision systems can sometimes make inferences about a scene that defies logic. For example, suppose a robot is processing a dinner table scene. In that case, it may wholly ignore a bowl visible to any human observer, estimate that a plate is floating above the table, or misinterpret a fork as penetrating rather than leaning against a bowl.
When the computer vision system is applied to a self-driving car, the stakes become much higher — such systems, for example, have failed to detect emergency vehicles and pedestrians crossing the street.
MIT researchers have created a framework that allows machines to see the world more like humans do to address these flaws. Their new scene-analysis artificial intelligence system learns to perceive real-world objects from just a few images and perceives scenes in terms of these learning objects.
The researchers created the framework using probabilistic programming, an AI approach that allows the system to cross-check detected objects against input data to determine whether the images captured by a camera are a likely match to any candidate scene. The system can use probabilistic inference to determine whether mismatches are caused by noise or by errors in scene interpretation that must be corrected with additional processing.
This common-sense safeguard enables the system to detect and correct any errors that plague “deep-learning” approaches utilized in computer vision. This probabilistic programming also allows you to infer probable contact relationships between objects in the scene and use common-sense reasoning to infer much more accurate positions for objects.
“In case you don’t know about the contact relationships, you could say an object is floating above the table — that’s a valid explanation.” As humans, we can see that this is physically impossible and that the object resting on top of the table is a more likely pose for the object. Because our reasoning system is aware of this type of information, it can infer more particular poses. “That is a key insight of this work,” says lead author Nishad Gothoskar, a Probabilistic Computing Project PhD student in electrical engineering and computer science (EECS).
This research could improve the performance of computer perception systems that must interpret complex arrangements of objects, such as a robot tasked with cleaning a cluttered kitchen and enhance the safety of self-driving cars.
Among Gothoskar’s co-authors is recent EECS PhD graduate Marco Cusumano-Towner, research engineer Ben Zinberg, visiting student Matin Ghavamizadeh, Falk Pollok, a software engineer in the MIT-IBM Watson AI Lab, recent EECS master’s graduate Austin Garrett, Dan Gutfreund, a principal investigator in the MIT-IBM Watson AI Lab, and Joshua B. Tenenbaum, the Paul E. Newton Career Development Professor of The findings will be presented in December at the Conference on Neural Information Processing Systems.
A flashback to the past
To create the system, dubbed “3D Scene Perception via Probabilistic Programming (3DP3),” the researchers drew on an early AI concept: computer vision can be thought of as the “inverse” of computer graphics.
Computer graphics is concerned with the generation of images based on the representation of a scene; computer vision is the inverse of this process. Gothoskar and his colleagues improved the learnability and scalability of this technique by incorporating it into a framework built with probabilistic programming.
“Probabilistic programming allows us to write down our knowledge about some aspects of the world in a way that a computer can understand, while also expressing what we don’t know, the uncertainty.” As a result, the system can automatically learn from data and detect when the rules are broken,” Cusumano-Towner explains.
In this case, the model is pre-programmed with knowledge of 3D scenes. For example, 3DP3 “knows” that settings are made up of various objects and that these objects frequently lay flat on top of each other — but this is not always the case. This allows the model to reason about a scene more naturally.
Recognizing shapes and scenes
3DP3 first learns about the objects in a scene before analyzing an image of that scene. 3DP3 remembers the form of an object and estimates the volume it would occupy in space after being shown only five pictures of it, each taken from a different angle.
“If I show you an object from five different perspectives, you can construct a fairly accurate representation of that object.” “You’d recognize that object in a variety of scenes because you’d understand its colour and shape,” Gothoskar says.
“This is far less data than deep-learning approaches,” Mansinghka adds. The Dense Fusion neural object detection system, for example, necessitates thousands of training examples for each object type. 3DP3, on the other hand, requires only a few images per object and reports uncertainty about the parts of each object’s shape that it does not know.”
To represent the scene, the 3DP3 system generates a graph in which each object is a node and the lines connecting the nodes indicate which objects are in contact with one another. As a result, 3DP3 can generate a more accurate estimate of how the objects are arranged. (Deep-learning approaches use depth images to estimate object poses, but their estimates are less accurate because these methods do not generate a graph structure of contact relationships.)
Models that outperform baseline models
The researchers compared 3DP3 to several deep-learning systems, all of which were tasked with estimating the poses of 3D objects in a scene.
3DP3 generated more accurate poses than other models in almost every case and performed significantly better when some objects were partially obstructing others. And 3DP3 only required five images of each object to train, whereas the baseline models it outperformed required thousands of photos.
3DP3 improved its accuracy when used in conjunction with another model. For example, a deep-learning model might predict a bowl floating slightly above a table. Still, because 3DP3 understands the contact relationships and recognizes that this is an unlikely configuration, it can correct itself by aligning the bowl with the table.
“I was surprised by how large the errors from deep learning could sometimes be — producing scene representations where objects didn’t match what people would perceive.” I was also surprised that only a tiny amount of model-based inference in our causal probabilistic program was sufficient to detect and correct these errors. Of course, there’s still a long way to go before it’s fast and robust enough for challenging real-time vision systems — but for the first time, we’re seeing probabilistic programming and structured causal models outperforming deep learning on challenging 3D vision benchmarks,” Mansinghka says.
The researchers hope to push the system further in the future to learn about an object from a single image or a single frame in a movie and then detect that object robustly in different scenes. They’d also like to use 3DP3 to collect training data for a neural network. Because labelling images with 3D geometry is often complex for humans, 3DP3 could generate more complex image labels.
“The 3DP3 system combines low-fidelity graphics modelling with common-sense reasoning to correct large scene interpretation errors made by deep learning neural nets,” according to the company. Because it addresses important failure modes of deep learning, this approach may have broad applicability. “The MIT researchers’ achievement also demonstrates how probabilistic programming technology previously developed under DARPA’s Probabilistic Programming for Advancing Machine Learning (PPAML) program can be applied to solve central problems of common-sense AI under DARPA’s current Machine Common Sense (MCS) program,” says Matt Turek, DARPA Program Manager for the Machine Common Sense Program, who was not involved in the research but partially funded it.