Alessandro Acquisti’s Black Hat USA report on face recognition continues with the detailed description of an online-to-online re-identification experiment conducted with the use of PittPatt application, based on analyzing and comparing data from identified and unidentified databases.
In the first experiment, we mined data from publicly available images from Facebook, and we compared it to images from one of the most popular dating sites in the U.S. The recognizer that we used for this is an application I mentioned earlier – it’s called PittPatt. It was developed at Carnegie Mellon University and has been acquired by Google recently. It does two things: first is face detection, and then face recognition. Detection is finding a face on a picture. Recognition is matching it to other faces according to some matching scores.
Facebook profiles were identified. Interestingly, we didn’t even log on to the Facebook to download these images. We wanted to show really that this could be done without even getting into the network. We used the API of a popular search engine to look for Facebook profiles that are likely to be in a certain geographical area. So the only thing which we could access on a Facebook profile was what is publicly available directly on the search engine, which is – for people who make themselves searchable on the search engine – your primary profile photo and your name. By the way, most users do make themselves searchable on search engines, making their profile searchable as it is also a default setting. And as we know from behavior decision research, people tend to stick to default settings.
We had the ‘noisy’ profile search pattern. So we had to write code to try to infer who these people in these geographical areas were. It used to be a few years ago that this task would have been much easier because Facebook at that time was actively using regional networks. This is no longer the case. However we have something such as Current Location which we used, plus we used another combination of searches such as whether they were members of for instance colleges in the same area, whether they were fans of companies or teams in the same area and so forth. Obviously, this is a ‘noisy’ search.
110,984 unique faces
The second database was the dating site, not identified because people, members of dating sites of course use photos, but also use pseudonyms. You have to use a photo if you want to go out on a date because researches show that if you don’t, the chances that someone replies to your invitation are pretty slim. However, people don’t like to put their real names, there is still an element of social stigma and therefore people use pseudonyms in online dating sites. However, because they use their facial, frontal photos they can be identified by friends who happen to be on the same dating site or perhaps by strangers.
If you had to do this comparison manually it would be unrealistic, impossible. You would have hundreds of millions of face comparisons to do. So no human being can really take the time of having one browser open on Facebook and the other browser open on the dating site, and hope to find matches. But of course this is not a problem for a computer, especially when you start using in parallel cloud computing, because it allows millions of face comparisons in seconds.
4,959 unique faces
So this is in a nutshell what we did. These are fake images by the way; I am not exposing the private data of real dating site users. So we started from a dating site image and we compared it to an image we found on Facebook – and we ended up finding the Facebook profile, and then we could give the name to the person on the dating site.
Now, the overlap between the two sets is obviously noisy. Why? Because our search for the Facebook profile was based on keywords, not really on geographical search. So it was a little bit noisy. We cannot know exactly the overlap between the two, and we also don’t know what the ground truth is, in a sense that we don’t know exactly how many Facebook users are also members of the dating site, and how many members of the dating site are also members of Facebook.
So before we actually ran the experiment I was describing, and I am stepping back for a second, we ran two surveys online: one with subjects, users, participants from the same geographical area I am talking about; and one with somewhat nationally represented sample. And what we asked was questions such as: “Are you on Facebook? Are you on this very dating site? How long have you been here?”, and so on and so forth.
So what we found is that for the people in the city we were studying, all the people who were on the dating site were also on Facebook, although the sample size there was pretty small. Across all our subjects, only 3% were on the dating site.
Nationwide we got similar numbers in the sense that 90% of the people who admitted being on the dating site were also on Facebook. And about 4% of all our subjects were on this particular dating site currently. If we included those who mentioned that they had been in the past on that dating site, the percentage goes up to 16%.
We didn’t ask for actual names because this was all anonymous, but we did ask if they used their real first and last names on Facebook. So take it as you may, there is always an element of self selection in this kind of surveys, but the results we got is that 90% of our subjects said ‘Yes’, they were using their real names on Facebook.
About 90% of Facebook members claimed to use their real first and last names on FB
And this data matches pretty well the study that Ralph Gross and I did a few years ago. This one was limited on Carnegie Mellon University students, and we pretty much got the very same percentage – 89% of users. In that case, we were able to verify the numbers because we could compare the answers to the survey with the actual profiles existing on Facebook CMU network. So it seems to be a relatively robust percentage.
Notice that PittPatt had to do a pretty large number of comparisons because there were more than 500 million potential pairs, and we were comparing each photo coming from Facebook to each photo coming from the dating site. We considered focus only on the best matching pair found for the dating profile, meaning that you have an image from the dating profile and you can have a list of one thousand, long list of potentially matching Facebook profiles. So we considered only the very top, in the sense that image for the Facebook profile would have the highest matching score found by PittPatt.
The matching scores that PittPatt produces are between -1.5, which is totally not a match, like one black image and one white image; to 20, meaning pretty much the same JPEG file.
We crowdsourced the validation of PittPatt’s scores to Amazon MTurk1, because we wanted some external validation. So we basically created a script where we had MTurkers who could not have known where this data came from, there were no names of course attached to the images. They had to rank it on the scale 1 through 5, meaning that the matches found by PittPatt were: definitely a match (1), likely a match (2), unsure (3), likely not a match (4), definitely not a match (5).
As you know, there is quite a bit of research of using MTurk properly because some people on the MTurk are very good diligent workers, and some people are just unbelievable cheaters. So there are a number of strategies that we have developed over time to cut out the cheaters. In addition to the traditional strategies that are used, such as trick questions before the survey even starts, what we also did was we inserted test pairs – definitely good matches and definitely bad matches. And if an MTurker got any of them wrong, we would kick that out, in the sense we wouldn’t consider their results in our evaluation. We also had at least five graders grade each pair.
At the end of the day, this is what we found: sure matches 6.3%, sure matches + highly likely matches 10.5%. What does it mean – sure or highly likely? Sure is if 2/3 of our MTurk graders gave a ranking of ‘This is a sure match’. Highly likely is if the majority of our graders gave a rating of ‘Sure match’. So about one out of ten users of the dating site were potentially identifiable.
About 1 out of 10 dating site’s pseudonymous members is identifiable
Now, consider that we only used one single Facebook photo, as I mentioned earlier, only using the primary profile image coming from a search engine search. And we only considered merely the first results found by the recognizer, rather than for instance the top ten. This is very important because these days recognizers can often be faked by some false positives. Sometimes the difference between two images is so little that maybe your best, your real match is the second or the third, or the fourth. Here we focused on the first only. But next, in experiment two we considered larger and more different models of attack. And of course recognizers’ accuracy will keep increasing.
1 – Amazon MTurk (Amazon Mechanical Turk) is a crowdsourcing Internet marketplace that enables computer programmers (known as Requesters) to co-ordinate the use of human intelligence to perform tasks that computers are currently unable to do.