Quantcast

“How safe is your browser?” – 2: Trackable browser fingerprints

Read previous: “How safe is your browser?” – Peter Eckersley on personally identifiable information basics

The second part of Peter Eckersley’s Defcon talk called “How unique is your browser?” is dedicated to describing the ‘Panopticlick’ experiment set up by the Electronic Frontier Foundation. The speaker describes different sets of browser fingerprint data, outlines the methods applied for measuring those values and analyzes some of the results obtained.

We put up this little website – www.panopticlick.eff.org. And if you went to that website – you can still go there – you’d see a page that tells you what’s going on and then gives you a little button you can click if you want to be part of the experiment. And if you click on that you get a page that says: “Oh, your browser appears to be unique, conveys up to 20 bits, possibly more, of identifying information coming from your fingerprint”. And then there is this little table showing what all the component measurements were, and how identifying each of those works.

Fingerprint information components measured within 'Panopticlick' project

Fingerprint information components measured within 'Panopticlick' project

So these are the 8 measurements we had (see image). These top 3 were just a data string that your browser sent to a web server when you asked for a page. The next 4 come from JavaScript. So there is a little bit of JavaScript that runs on the page, and if your browser supports JavaScript and it doesn’t have it disabled, the JavaScript will collect this information and send it back to us with an HTTP post command. And lastly, if you have Flash or Java installed, we’ll go into Flash or Java, or both and ask those plugins for your list of system fonts. So we have these different measurements.

It will turn out that the 2 most problematic ones – they are all kind of problematic – but the 2 most problematic ones of these plugins here are these fonts (Flash/Java) at the bottom.

So there are a lot of things we didn’t collect but which you could use to make these fingerprints even nastier. And in fact, it turns out we’ve seen some companies in the private sector that will sell you a fingerprinting system that doesn’t just do the kind of 8 things we did but actually also it does a lot of other stuff.

One particularly nasty thing is you can measure the clock skew of the quartz crystal – that is, how much faster or slower than another clock your computer is. And that’s very hard to hide and it’s unique to your hardware rather than just to your software. You can measure the characteristics of the operating system’s TCP/IP implementation. You can measure the order in which the headers show up.

There is lot of stuff in ActiveX, Silverlight and other Adobe libraries that we didn’t have time to dig through and find and use, but you can do that. There are quirks to the way that each browser has implemented JavaScript that could be identified with the right code. There is a really nasty bug that just recently started being fixed in browsers where you can measure the history of a browser using CSS detection.

And some of these things we didn’t collect because we just didn’t have time to implement them. Some we didn’t collect because we didn’t know about them. Once we put up the site, we got a lot of emails saying: “Hey, you could collect these things as well”. And lastly, some things like the CSS history, we were not sure would be stable enough to include in a fingerprint without some kind of fuzzy matching code that we didn’t have.

So, the point here is that all of our results should be taken as a kind of optimistic story about how much privacy you have: with a really good fingerprint, the fingerprint is more powerful and more revealing.

So the way we handled the data of our site is we set up a 3-month persistent cookie and we stored an encrypted IP address with a key that we threw away. And we used those primarily to avoid counting you twice. If you came back we wanted to know if you were the same person with the same fingerprint or another person with the same fingerprint, because those two are very different things.

And we had an exception for that, which is if at a particular IP address we saw one cookie A and then cookie B, and then cookie A again, we thought that’s probably, or at least potentially, evidence that there are more than two computers behind that IP address that pass behind the NAT firewall, that have the same fingerprint. And that could be important because it could be a sign that there is a corporate network there and that there is some sysadmin who clones all the same code out to all the machines everyday. And so there are genuinely multiple machines with identical fingerprints there, and that gives the people in that office some protection. So we decided not to treat those as repeat visits from the same browser if they were interleaved.

One thing to note is a lot of people are confused by the numbers we have on our site. Those use the cookies but not the other methods of avoiding double counting, because we had to compute that stuff on the fly for millions of visitors. The data set I am presenting in this talk has more fancy control for that stuff.

So we got a pretty big data set. We had 2 million hits, about a million distinct browser instances there. The people that we were measuring were not representative of the entire web user base. People who come to EFF site are mostly people who know about and care about privacy. However, as I said before, you have to jump through three hoops to not be tractable on the web. So we think this is kind of a relevant data set to ask. The people who block cookies and know about IP addresses and how to hide them and so forth – end up being tractable by their fingerprints instead. And another point is that data in this talk is all based on the first 470,000 instances – the first half of the data set.

83.6% had completely unique fingerprints
(entropy: 18.1 bits, or more)
94.2% of “typical desktop browsers” were unique
(entropy: 18.8 bits, or more)

It turned out people were really unique. 84% of the browsers that came to our test site were unique completely in the data set. If you split the data set up, and say let’s just look at the browsers that have either Flash or Java installed – and that’s the best relevance set if you are talking about desktop browsers – your uniqueness rates go up to 94%. And only 1% have a fingerprint that we saw more than twice.

Fingerprints uniqueness distribution

Fingerprints uniqueness distribution

So, this is the same thing on a graph – note that this graph has a log axis (see graph). If you drew this graph of how common the different fingerprints were without log axis, you’d end up with the graph where the line runs exactly along this axis all the way, and then exactly along the other one. So in order to see any structure you have to put it on log scales. But the important thing to know is that 84% of the data which is on the straight line tail in the bottom right section is unique. And there is another group of people (on the horizontal straight line slightly higher), about 20,000 who have an anonymity set size of 2, I mean there were 2 browsers that had that fingerprint. You know, a small group with 3,4,5. And then, right on the other range you have a small number of browser fingerprints that were not very unique. The one right at the top is a Firefox instance that’s not running JavaScript. So with a recent version of Firefox with no JavaScript, you have a decent amount of anonymity.

There is an interesting statistical question you might ask, which is, okay, sure you saw 94% or 84% uniqueness in your data set, but that was only 500,000 people. Would people be less unique if could get data for the whole 1 to 2 billion people who use the web? And this is an interesting statistical question. I have a theory about how to solve it involving Monte Carlo simulations1: you try a hypothesis probability distribution, you run it through a simulation, you see if it produces a graph that looks like the one I showed.

But we didn’t try to do this because in a sense our data set, which is just a measurement of privacy, is not meaningfully representative of 1 to 2 billion browsers in existence. So if someone else has a less biased data set, you could do this statistical question. We didn’t try.

Uniqueness trackability by browsers

Uniqueness trackability by browsers

Now, any graph that tries to show you everything that’s going on in this data set is gonna be really complicated because it’s half a million data points, but I’m gonna try with a couple of them. This one shows for each category of browsers (see graph), so Firefox, MSIE, Opera, Chrome, Android, iPhone, Konqueror, BlackBerry, Safari, and then I lump together a collection of Lynx and other text mode browsers. For each of those, how good or bad was it from a uniqueness trackability point of view?

And so, if you look at this graph, anything that’s on the right-hand axis is a proportion of uniqueness. These things were completely unique in our data set. On the other end, we have the least revealing fingerprints. So let’s take an example, Firefox is the black line. It follows the curve at the bottom left, it has a little bit of a tail in the non-unique area. That’s because some people had JavaScript turned off in Firefox, or they were running Torbutton2 and it shows up as Firefox. And then, at the top right there is a very large number of unique Firefoxes.

All of the desktop browsers aside from Firefox are like that, but without the little tail of non-unique people. So generally desktop browsers are bad. One of the browsers that did well is iPhone. The iPhone does very well. It’s not very fingerprintable, it’s this purple line, and there are quite a lot of iPhones that are not unique. That’s perhaps not surprising because there aren’t yet plugins and font variation on iPhones. Really, all you are talking about is what time zone you’re in, what language you have and maybe which version of the iPhone OS you have. But there’s not very much to fingerprint iPhone with.

Android does almost as well, not quite as well because there are more iPhones than Androids, but those phone browsers look pretty good. In practice of course, they have really bad cookie settings, so people who use them probably get tracked by the cookies. But this was a good result for the phones.

Variables measured by entropy size

Variables measured by entropy size

Now, if we look at the variables that we measured, we had these 8 measurements (see image). So which ones were the problematic ones? This table measures the first order which they were. So User Agent is pretty bad, it’s 10 bits of information. Every time you log in and the web server logs your User Agent, you’ll expect on average that you are narrowing the population down to one thousandth of what it could have been if someone wants to browse anonymously. The things that were worse were plugins – 15.4 bits, fonts – around 14 bits.

Distribution of variable values

Distribution of variable values

So these things that your browser publishes are very revealing. If you wanna ask, okay, what does the distribution of different values for all for these things looks like, you end up with this crazy graph (see graph). It’s in our paper as well, I can try to explain it. It says how many people, for each of these measurements separately (there are 8 different measurements), fell into an anonymity set size of k for each of the measurements.

Anonymity set size of 1 means you are completely unique because of your fingerprints or your font, or your plugins. So you see up here (in the top left section) there are a lot of people who are unique: 200,000 – 250,000 people who are unique just because of the plugins they have installed on their browsers. There were 200,000 (see vertical axis from the top down) who were unique just because of their fonts. 25,000 were unique just because of their User Agent etc.

As you go from left to right, these are less identifying values. And then in the top right part of the graph, we see that having cookies enabled wasn’t a very revealing fact.

Read next: “How safe is your browser?” – 3: unique browser fingerprints and trackability prevention
 

1Monte Carlo simulations (or Monte Carlo methods) are a class of computational algorithms that rely on repeated random sampling to compute their results.

2Torbutton is a Firefox browser extension that enables anonymizing one’s web-surfing.

Like This Article? Let Others Know!
Related Articles:

Leave a comment:

Your email address will not be published. Required fields are marked *

Comment via Facebook: