By: John Markoff
NY Times, November 19, 2012
STANFORD, Calif. — You may think you can find almost anything on the Internet.
But even as images and video rapidly come to dominate the Web, search engines can ordinarily find a given image only if the text entered by a searcher matches the text with which it was labeled. And the labels can be unreliable, unhelpful (“fuzzy” instead of “rabbit”) or simply nonexistent.
To eliminate those limits, scientists will need to create a new generation of visual search technologies — or else, as the Stanford computer scientist Fei-Fei Li recently put it, the Web will be in danger of “going dark.”
Now, along with computer scientists from Princeton, Dr. Li, 36, has built the world’s largest visual database in an effort to mimic the human vision system. With more than 14 million labeled objects, from obsidian to orangutans to ocelots, the database has because a vital resource for computer vision researchers.
The labels were created by humans. But now machines can learn from the vast database to recognize similar, unlabeled objects, making possible a striking increase in recognition accuracy.
This summer, for example, two Google computer scientists, Andrew Y. Ng and Jeff Dean, tested the new system, known as ImageNet, on a huge collection of labeled photos.
The system performed almost twice as well as previous “neural network” algorithms — software models that seek to replicate human brain functions.
Nor are the Google researchers the only ones who have used the ImageNet database to test their algorithms; since 2009, more than 300 scientific publications have used or cited it.
Computer vision is one of the thorniest problems facing designers of artificial intelligence and robots. A huge portion of the human brain is devoted to vision, and scientists are still struggling to unlock the biological mechanisms by which humans learn to recognize objects. “My dream has long been to build a vision system that recognizes the world the way that humans do,” said Dr. Li, whose Princeton colleague is the computer scientist Kai Li (they are not related).
When she began to assemble her system in 2007, Fei-Fei Li said, the only alternatives were “toy” research databases that recognized only a handful of types of objects. The challenge was how to increase the scale of the system to bring it closer to human capabilities, especially amid the rising torrent of online images.
“In age of the Internet, we are suddenly faced with an explosion in terms of imagery data,” she said. “Facebook has 200 billion images, and people are now uploading 72 hours of new video every minute on YouTube.”
Dr. Li did a quick back-of-envelope calculation and determined that if she gave the task to one of her graduate students, it could take decades.
Luckily, while the Internet has given rise to a mountainous digital haystack of imagery, it also offers a path to clarity.
Dr. Li realized that Mechanical Turk, the Amazon.com system for organizing thousands of humans to do small tasks like describing the contents of a picture, was the perfect way to assemble her database.
Using available university research funds, the ImageNet visual database project has now become the world’s largest academic user of Mechanical Turk workers, who are known as “turkers.” Each year, ImageNet employs 20,000 to 30,000 people who are automatically presented with images to label, receiving a tiny payment for each one.
The average turker can identify about 250 images in five minutes. The ImageNet database now has 14,197,122 images, indexed into 21,841 categories.
“Its size is by far much greater than anything else available in the computer vision community, and thus helped some researchers develop algorithms they could never have produced otherwise,” said Samy Bengio, a Google research scientist.
He added that ImageNet was not perfect. To organize the vast collection of images, Dr. Li uses WordNet, a database of English words designed by the Princeton psychologist George A. Miller, who died in July at 92. For Dr. Bengio, its categories are a little too elevated.
“I would have preferred if the categories chosen in ImageNet were more reflecting the distribution of interests of the population,” he said. “Most people are more interested in Lady Gaga or the iPod Mini than in this rare kind of diplodocus.”
Still, the project goes on. Jia Deng, one of Dr. Li’s graduate students, has developed an image classifier he jokingly calls infallible. Because WordNet is organized as a hierarchy of categories, the software can simply choose a level of abstraction where it has a very high probability of being correct: if it is not sure a given picture shows a rabbit, for instance, it goes to the next level (mammals) or the one above that (animals).
At one of those levels, it will almost certainly not be wrong. And Dr. Li says she expects further advances that allow ever more accuracy.