A group of young people playing a game of frisbee.
A pizza sitting on top of a pan on top of a stove.
A person riding a motorcyle on a dirt road.
These are automatically generated captions from a computer model that starts with just the raw pixels of an image.
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan in our research group at Google have been working on automatically generating these captions using an accurate convolutional neural network (similar to the one that won the 2014 ImageNet object recognition challenge) combined with a powerful recurrent neural network language model (using an LSTM, a particular kind of recurrent network that is good at capturing long-range dependencies in sequence data, similar to the model that was used recently by our group’s recent work on using LSTMs for machine translation). The system initializes the state of the language model with the features from the top of the convolutional neural network, and is then trained to generate captions using a modest amount of human-labeled training data of (image, caption) pairs, and the resulting system does a good job of generalizing to generating captions automatically from previously-unseen images.
Since two of these folks sit within 15 feet of me, I’ve enjoyed watching their progress on this project and chatting with them over the past few weeks as it has developed. The examples you can see in the New York Times article are great examples of what the system can do: it doesn’t always get it right, but in general, the captions it generates are very fluent, mostly relevant to the image, and sometimes show a surprising level of sophistication. Furthermore, because it is a generative model, and we’re sampling from the distribution of possible captions, you can run the model multiple times, and it will generate different captions. For one image, it might generate the two different captions “_A close up of a child holding a stuffed animal_” and “_A baby is asleep next to a teddy bear._”
John Markoff of the New York Times has written up a nice article about this work (along with some similar research out of Stanford that has been happening concurrently):
A Google Research blog post about the work has also just been put up here:
http://googleresearch.blogspot.com/2014/11/a-picture-is-worth-thousand-coherent.html
An Arxiv paper about the work has been submitted, but is waiting for the Arxiv review process to happen (the paper should be available in the next day or so).
You can see a few more examples at the end of the set of slides from a talk I gave recently in China (pages 75 to 79 of this PDF):
http://research.google.com/people/jeff/CIKM-keynote-Nov2014.pdf