# “How safe is your browser?” – Peter Eckersley on personally identifiable information basics

Technology Projects Director for EFF Peter Eckersley speaks at Defcon 18 on methods for identifying a person based on different sets of information, including data like one’s ZIP code, birthdate, age and gender, gradually shifting the focus onto some computer-based criteria such as cookies, IP address, supercookies and browser fingerprints.

Hello everyone, so today I am going to talk about browser fingerprinting, in particular about an experiment that we did at EFF1 to measure how fingerprintable web browsers work, called ‘Panopticlick’.

So before we get to browsers, let’s talk about identifying information. Now, when we ask what kind of information identifies a person, we have some standard sorts of answers, like if I know their name and address, I probably know who they are.

But there are some more surprising examples of identifying information. There’s a paper by Latanya Sweeney2 written in the 90s, stating that if you know someone’s ZIP code and their date of birth and their gender, then you have about an 80% probability of being able to identify them uniquely. Well, that’s a bit surprising, so let’s see how it happens.

Calculation for idenifying a person

If you start with 7 billion people on the face of the planet Earth, you rapidly narrow yourself down if you know the ZIP code to a group of about 20,000 or maybe 50,000 people, maybe in some cases 100,000, but not very many. And then you can divide that number by 365 because you know their birthdate. You can divide by a more complicated number that’s about 70 because you know how old the person is. And then you can divide in half again because you know whether they are male or female (see image).

And it turns out that if the ZIP code you started with had fewer than about 50,000 people in it, you now probably have a unique person at the end of this process.

So there is a mathematical measure you can use to say how identifying a set of facts about a person is, or how much information is required to identify someone. And if you need more bits to identify them, then each bit doubles the number of possibilities. And if you are learning more facts about them, then each bit you learn halves the number of possibilities, so you can think of these as trading off against one another.

So for instance on the face of the planet Earth with 7 billion of us people, you need about 33 bits to learn the identity of one of us. And if you learn someone’s birthdate – what day of the year they were born on – you learn about 8.51 bits.

To identify a human, we need: log2 7 billion = 33 bits
Learning someone’s birthdate: log2 365.25 = 8.51 bits

So if we talk about a random variable you might measure, like someone’s birthdate, you can talk about the amount of information you learn when you learn a particular value of that random variable. So if we are talking about birthdate, you learn that my birthdate is the 1st of March, then you’ve learned 8.51 bits about my identity. If, however, you learn that my birthdate is the 29th of February on a leap year, then you get a bit more information because the likelihood of that being true is only a quarter of what it would be for any other birthdate, so you get more information – 10.51 bits typically.

So we call that first measurement the surprisal, or self-information of the fact you’ve learned. So surprisal that I was born on the 29th of February is 10.51 bits. And then we can talk about the entropy of this type of measurement, which is the expectation value of the surprisal. So if you have a probability distribution, you measure the expectation across all of that.

Birthdate = 1st of March: 8.51 bits
Birthdate = 29th of February: 10.51 bits

A point to know is that you can’t add surprisals together. If you learn someone’s birthdate and then you learn what city they were born in, those two things are probably independent variables – not entirely but close to. So you could add the number of bits together. But if you already know someone’s birthdate, and then someone tells you their star sign, you are actually not going to learn any more information. So if you want to know how two of these measurements add together, you need to do some fancy stuff with conditional probabilities.

So what use is all of this theory? Well, we can apply it to tracking web browsers. Now, what do I mean by tracking web browsers? Two things: one is, if you go to a website on day A, and then you come back 3 weeks later, and you do something else, can the website link those two acts of yours together? And also, if on a given day you go to two different websites, and you maybe look up one fact on one, and then another fact on another, then the question is: could these two sites get together and combine the facts and get a deeper picture about you? Either of those would be tracking.

Now, there are at least three ways that are widely discussed to track web browsers. You can use cookies, which are these little bits of information your computer will store and send back, and sites can put serial numbers and tracking numbers in them. But of course we all know about cookies. You all have browser settings to delete them or limit them. So we can survive cookie tracking.

Ways to track browsers:

You can think of these things as hoops that you have to jump through in order to get privacy on the web. If you wanna not be tracked, you have to avoid tracking by cookies, avoid tracking by IP addresses, avoid tracking by supercookies. And then if you do that, it’s time to talk about whether you can be tracked by a fingerprint.

So, a fingerprint is like the example of tracking someone with some facts like the ZIP code and a date of birth. It turns out that the characteristics that the web browsers have, like the version of the browser, what operating system it’s on, etc., combine together in the same way as those other facts and perhaps they make your browser unique.

Now, there are different degrees of uniqueness that you might get out of the version information of your browser, you might have complete global uniqueness. So you sitting there in the 3rd row, your browser is completely unique in the whole wide world, and we know it. When we see it, we can track your browser. But perhaps that’s not true. Even so, browser fingerprints may be a problem because they mean that your IP address combined with your fingerprint is uniquely identifying.

Fingerprint Privacy Threats

And the last thing that makes fingerprints really nasty is that unlike cookies, they are automatically the same when different sites collect them. So if a website over here sees your fingerprint and a website over there sees your fingerprint – they see the same thing. If they both tracked you with cookies they’d get different cookies, and they’d have to have some sneaky process to link them together, whereas fingerprints are automatically the same.

We heard a lot of rumors at EFF about fingerprints. We heard that some web tracking and analytics companies have started using them. We heard rumors that web based DRM3 systems were using fingerprints to track people. We heard that they were being used as a backup authentication mechanism for financial systems, which is maybe less of a problem than first two.

And we were really curious about how affective this method was. We also worried about a more mundane question, which is every single time you go to a web site, almost all websites are configured to log your browser’s User-Agent string. And we were wondering how much of a problem that was, so we decided to get some numbers to find out.

1EFF (Electronic Frontier Foundation) is an international non-profit digital rights advocacy and legal organization based in the United States.

2Latanya Sweeney is a Distinguished Career Professor of Computer Science, Technology and Policy at Carnegie Mellon University, founder and director of the Data Privacy Lab, and an elected fellow of the American College of Medical Informatics, with almost 100 academic publications.

3DRM (Digital rights management) is a class of access control technologies that are used by hardware manufacturers, publishers, copyright holders and individuals with the intent to limit the use of digital content and devices after sale.