Visual Similarity with Neural Networks
This project is based on some of the same concepts as Semantic Image Clustering with Neural Networks, but instead of trying to present the entire universe of photographs at one time, it uses a browser's interest in one particular photograph to retrieve all images that resemble it.
The problem of visual similarity in large digitized image collections has been tackled by a number of digital humanists and art historians. John Resig, author of the jQuery JavaScript library, has a terrific video describing his work on Aggregating and Analyzing Digitized Japanese Woodblock Prints, and has additionally detailed his Kress Foundation-supported efforts on the Frick Photoarchive in Using Computer Vision to Increase the Research Potential of Photo Archives. Both of these projects use TinEye’s commercial MatchEngine software, which offers a turnkey (if somewhat black box) service for computing visual similarity in large collections of images.
During the 2016 Digital Humanities conference held in Krakow, I was excited to a presentation by Benoit Seguin et al on Visual Patterns Discovery in Large Databases of Paintings. This talk, by a team working on the Cini photography archive, suggested ways that pre-trained convolutional neural networks designed for image captioning could be appropriated and made useful for older visual culture collections. Here’s why and how: the final result of a captioning neural network is a set of descriptive words or phrases, together with a confidence level. A picture of a cat, when run through a captioning network would (hopefully) generate cat 98% and perhaps tiger 5% or dog 4%. Thee labels work great for contemporary pictures we take with our smartphones: coffee, pets, sunsets, closeups of flowers. But the farther you get from 2010 — and arguably from middle-class users in the global West and North — the less applicable these labels are. A set of medieval church paintings are unlikely to show latte art, and abstract sketches by an interwar refugee are hard to describe as cat or dog.
But these final labels are only the end result of a very complicated flow of virtual neurons firing and evermore specific levels. Convolutional neural networks work by first sensing broad patterns — such as horizontal lines, small circles, or vaguely-triangular green things. Each successive layer in the network gets more and more specific, till you end up with cat or dog (or whatever you’re trained the network to see.) The penultimate layer in a pre-trained labeling network — that is, the one right before the final cat or dog decisions — contains a robust . You can think of this semi-final layer as having neatly solved the featurization problem: what parts make up an picture? A network pre-trained to distinguish cat from dog from cappuccino from car has learned to see in some pretty powerful ways. In many of the networks we use, this semi-final layers encapsulates that knowledge in 2,048 dimensions, or ways of seeing. And we can take this semi-final layer, with all of its accumulated knowledge about how vision works, and use it for the purpose of image similarity.
In middle school we learn about X and Y — and then Z, often represented by a dotted line going “in” to the chalkboard and some angle calculated by your math teacher to represent a 3-D perspective on an all-too-two-dimensional surface. 2,048 dimensions is way too many for a human to keep in their head. But computers can analyze these imaginary spaces very efficiently. We can use a technique known as approximate nearest-neighbors, helpfully implemented by Spotify in this open-source library, to find the images that are closest to a start image in all of these various dimensions.
Enough math talk! How does this work in real life? Surprisingly well…
In this image we start with a photo of a 19th-century pugilist. Hovering our mouse over him shows the images up top, which are listed in descending order of visual similarity. Although it’s dangerous to speculate on how neural networks “see” with examining the underlying layers, we might intuit that the network is responding to a particular pattern of strong diagonals expressed in these boxer’s arms. It’s presumably not responding to skin tone, as it’s appropriately placed an African-American boxer alongside those of a lighter complexion.
You can try this tool out on two different datasets:
About 27,000 images from Civil War to the Gilded Age, drawn from the Meserve-Kunhardt Collection at Yale’s Beinecke Rare Book & Manuscript Library.
About 10,000 images taken by Per Bagge, a Swedish photographer active in early 20th-century Lund.