Alessandro Acquisti now describes an offline-to-online re-identification experiment where someone’s anonymous photo helps find the Facebook profile, predict the SSN and figure out the subject’s interests.
3 photos per participant profiles
114,745 unique faces
And the process was like this: the subject would sit at the desk, and as the photo was taken, the subject would start filling out the survey about Facebook usage. Meanwhile, the image was sent up to a cloud service, a cloud computing cluster where the shots taken were matched against Facebook images.
By the time the subject would have reached the last page of the survey, that page had been dynamically populated with the top best images found by the recognizer. And the participant had to click on “Yes, I recognize myself on this photo”, or “No, I do not recognize myself”.
So this is the schematics (see image): we had two laptops – one laptop was for the subject who was filling out the survey, the other – for the experimenters because from here we were sending the image to the cloud and getting back the results.
93 subjects participated in this experiment. The survey that they filled out told us that they were all students, and they were all Facebook members, and for 31% of them we did find indeed a match.
One out of three were identified in this way, including one subject who told us right before starting: “You will never find me”. The reason was that he didn’t have a photo of himself on Facebook, or so he believed. And this is an interesting story. We found him through an image made publicly available by one of his friends. And the biggest story here is that it’s not just how much you are revealing about yourself, but it is also how much your friends or other people are revealing about you.
The average computation time was less than three seconds. It took much of this time for the recognizer to find the match than for the student to fill out the survey and arrive to the last page with the matches.
And then we wanted to push the envelope, because okay, face re-identification is interesting, but the story we want to tell is bigger than this. What we found here so far is that we can take an anonymous face online on a dating site or offline in the street, and we can find the Facebook profile which arguably for 90% of people should give us their name.
However, two years ago when we were at Black Hat, we presented a study where we showed that we could start from a Facebook profile and end up predicting people’s Social Security Number.
So as a visual aid, what we have done so far is we started from an anonymous photo, we used PittPatt and found a Facebook profile. But two years ago, we started with a Facebook profile, we took the Death Master File, which is publicly available information, and we ended up with someone’s Social Security Number. So you are seeing where I am going with this.
As a quick parenthesis, what we did two years ago was we showed that Social Security Number had been misinterpreted by public for many years, meaning that people believed that there was much more randomness than there really was.
If you tried to predict someone’s SSN with the knowledge that existed before we published the paper, it would be really hard – in the sense that the probability of getting the first five digits right with one guess for someone born in Alaska in 1998 would be 0.0014% if you don’t know where they were born; and 1% if you do know. The probability of getting the full nine digits with less than one thousand guesses would be also extremely low – like 0.00014% (see image).
What we did was we took the Death Master File data, the data from the SSN of people who are dead, and we organized it chronologically. The X axis here for Oregon and Pennsylvania for people born in 1996 – represents date of birth, and the Y axis represents the area number (the first digit), the group number (the mid two digits of your SSN), and the serial number (the last four digits). So if you organize it graphically, you start seeing very obvious patterns such as group number remains constant over a period of time for people born in the same state over a certain period of dates; area number cycles in stepwise sequence; serial number increases linearly (see graph above).
And with the help of all this we found out that accuracy of our predictions was much higher not just for the first five digits but also the last four than what was believed before (see image)– so much higher that brute-force attacks on SSN were indeed possible.
So this is what we wanted to do in our study. We took data from the Death Master File – say, John Smith born in July 1987 in New Jersey, and John Doe also born in July 1987 in New Jersey. And we compared it with John OSN (Online Social Network), who is not dead and is not in the Death Master File but happens to be also born in New Jersey, and also born in July. And by the process of interpolation we were able to end up predicting the Social Security Number of John OSN using the data form the other Johns who were instead dead (see image to the left).
Going back to our current study, we take someone’s face, anonymous, then we find their Facebook profile, and we end up predicting the SSN. So the story is predicting SSN from faces. But once again, this is just an example. I want to stress the reason we are doing this is really to think in terms of this blending of online and offline data, which I believe is inevitable five to ten years out.
What we did was we tried to predict from faces not only the SSN but also the interests of the subjects, we also wanted to add something easier to predict and less sensitive. So we asked participants in experiment two to participate in a new study. We used their Facebook profiles to predict their interests and other personal information.
And this is really a picture which tells a story, sometimes they say: “An image is worth a thousand words”. So this is a little script that we wrote to in a way visualize Facebook tags (see image). This is an image where the tag linking to a certain Facebook ID (the numbers you see) is corresponding to a certain face. This is pretty much what happens when you are tagging people on Facebook: you are effectively creating these boxes with numbers on them.
We are visualizing it because the story we want to tell is that these red boxes are what faces were put, actually what you put through tagging. The white box is what our identifier found, for instance it found a match for this face. And because we have the tag information, we can find her profile, and from there we can find the personal information.
So we called back the participants of experiment two. Not all of them wanted to participate because some were understandably slightly freaked out by the study, although we put all possible safeguards in place. All studies were IRB1 approved, and no face, no SSN was harmed in the process of doing these experiments.
It was important for us that we would be able to get aggregate statistics of our prediction accuracy. We failed being able to know whether a certain given subject’s SSN was accurate or not, because we didn’t want to be in the position of inferring the actual SSN of a subject. So we had to run experiments in the way that we only would get aggregate data, which we did, and we also explained this to the subjects, but some of them didn’t feel sufficiently comfortable, so they started and they stopped after reading what it was really about.
For those who ended it, which was close to 50% of those who started, we got 75% of all their interests right. If we used just two attempts for the first five digits, we got 16.67% of the subjects right. If we used four attempts, we got 27.78% right. Now, this number may seem either large or low to you depending on where you come from, but I can tell you they are big, in the sense that if you use the random chance to predict the first five digits of someone’s SSN with a single attempt, the likelihood of getting it right, as I showed in previous slides, is 0.00014%.
1 – IRB (institutional review board), also known as an independent ethics committee (IEC) or ethical review board (ERB), is a committee that has been formally designated to approve, monitor, and review biomedical and behavioral research involving humans.