May 2, 2017

This Giant Library of 3D Images will Teach Machines How to Recognize Objects

Computer vision applications have become shockingly adept at recognizing the content of images. Google’s “Quick, Draw” game, for instance, uses a neural net so sophisticated that it can guess what I’m trying to depict in borderline illegible doodles that I scratch out with my cursor. As good as they are, applications like this are still best at working with 2D images—recognizing objects in 3D is another thing entirely.

According to MIT Technology Review, academics from Stanford, Princeton, and the Technical University of Munich are looking to solve this problem by building the largest available set of annotated 3D available, which in turn will be used to train neural nets using the deep learning technique. They are calling the image library ScanNet, and are positioning it to be a kind of follow up to the ImageNet data set that sparked a flurry of development in 2D computer vision about five years ago.

The data set includes includes millions of annotated objects “like coffee tables, couches, lamps, and TVs,” situated in thousands of 3D scenes. As detailed in an academic paper, the researchers built it by scanning scenes using an RGB camera and an infrared depth sensor, and then giving iPads to volunteers to annotate those scenes.

The good news is that the team has already applied deep learning to the data set with promising results. The neural net they trained can recognize many objects reliably using only depth information.

Though it’s too early to celebrate just yet, the technology has clear uses. For one, it could help robots recognize the difference between an object to avoid—a kitchen table—and one that you want it to manipulate—like the bowl on that table. The neural nets could also be used in architectural and facilities management applications, where they would process a 3D data set and export detailed models made of discrete geometric objects.

Given the speed at which current computer vision applications work, it’s possible we could one day see an application that reads live 3D sensor and recognizes what’s in the data. Imagine what you could do with that.