Former Google engineer Brian Kennish delivers a speech at Defcon about the scope of user tracking being conducted by large media companies.
My name is Brian Kennish. I am gonna be talking about how our web browser history is leaking into the cloud. I never actually talk about myself much at events like this, but I think, given the topic, I have to do a little bit here, and it’s really gonna be more of a confession than autobiography.
About 10 years ago I showed up at DoubleClick, and my job was to figure out mobile advertising. And at that time, no one knew about mobile advertising, but I especially had absolutely no clue.
So I used DoubleClick’s money to get hold of every mobile device in the world that I could. I got a big pile of ugly phones, but there were a couple of cool-looking Japanese phones too. I plugged these things into a proxy server to see what data they were sending, and what data we could target ads against. I still clearly remember being kind of shocked when I saw that these things were transmitting location. And I thought to myself: “Why the hell would anyone want DoubleClick to know almost exactly where they are, to see an advertisement?”
But there I was, and of course I figured advertisers would be into this stuff, so I put it in our mobile ad server. It turned out we were like 7 years too early on mobile advertising.
And not being satisfied just working at the biggest data collection company in the advertising space, I went to the biggest data collector in the world – Google. I was an engineer at Google for a long time, worked on a lot of stuff, and mostly ad stuff as well: AdWords and AdSense, and later on I worked on Wave. The last thing I was working on was Chrome.
So about 10 months ago, while I was happily working on the Chrome team, I studied this article in the Wall Street Journal which was about Facebook leaking personally identifiable information to third-party app developers (see image). And it sort of got me thinking about the huge amount of data that Facebook was collecting about us, specifically all the data that they were collecting sort of in an invisible way, when we weren’t on Facebook.com.
So I went home that night and whipped up this quick Chrome extension called ‘Facebook Disconnect’ (see image). I spent about 4 hours doing this thing, and I thought it was really like a throw-away thing. People seem to be impressed when I tell them it took me 4 hours. But to be honest, I spent 2 and a half of those hours making the logo of this extension. And the entire code base of that thing is actually like 20 lines of code – that’s pretty embarrassing.
I had done a couple of personal browser extensions up to that point, I think one of them had something like 36 users. I just released this thing and figured there might be a world wide audience for like 50 people, the size of a football team. But within 2 weeks, there was an entire stadium full of people using this thing. More than 50,000 people had installed and were running it.
And that got me thinking: “Hey, maybe people actually care about this privacy stuff. I know I do.” So I wanted to do a follow-up extension that did more than just stop your data from going to Facebook. But the problem was, again, I was working at the biggest data collector in the world.
I asked the lawyer what would happen if I did a broadened extension that included, you know, depersonalizing stuff on Google. And he said I would probably get sued. I didn’t like that idea, so I quit Google and spent 3 weeks making this follow-up extension which I just called plain ‘Disconnect’. That stopped your browsing history from going to all the major social networks, and it also depersonalized your searches. So if you did a search on Google or Yahoo, it wouldn’t be tied to your name anymore.
So anyway, this stuff got a little bit of press attention, in particular a reporter from the Wall Street Journal asked me if I thought there were any big privacy stories that hadn’t been told yet. And I said: “Yes, social widgets.”
I explained to him what was going on in social widgets space, and I went in something like this. I am gonna do this very quickly because it’s a little bit of Web 101.
So we go to a web page, and the web page might contain some sensitive information, in this case this page is about depression treatment (see image). Besides the first-party content – the actual article on this page, – there is a bunch of third-party widgets and content on this page, one in particular here is an advertisement.
In order for your browser to render this ad, it sends a request to the ad server. And the request is just a bunch of plain text that looks like this (see image). Obviously, it tells your browser where to send this request, in this case this is an ad from DoubleClick. The request also contains this thing called Referrer URL which tells the server where the request came from. And in this case, it tells the server that we were looking at this page about depression treatment, it has the URL for that page. And finally, there can be a bunch of cookies in the request. In this case, one of the cookies has an ID in it. So this ID uniquely identifies me.
Now, most people are probably okay with this set of data that is being sent to an ad server, because presumably this number isn’t uniquely attached to me: it’s not my name, it’s just a random set of numbers. Well, I’ll talk later why that is maybe not such a good assumption anymore. But for the last 15 years or so, we could have assumed that this was anonymous information that was being sent to the ad server.
If we go back to this page and look at what else is on this page, we also have this bunch of social widgets. So we have stuff from Facebook and Twitter, and the new Google+ button. And if we look at the requests that get sent out, in this case it’s gonna look really similar (see image below).
So here we are looking at the request for the Facebook widget, it’s going to Facebook.com. We get that identical referrer URL, and finally, again, we have a cookie with a unique ID in it. Now, this looks almost identical to the request that we just looked at. But there is a huge difference here. The difference is that this ID is no longer just a string of numbers, it actually points to my Facebook profile. So it’s not just my browsing history with a set of numbers that Facebook is getting, it’s like Facebook is actually getting my name. They know that Brian Kennish is actually looking at that page. And not only are they getting that information, they are getting all the other information that I’ve exclusively given them, like my age, and where I live, and who my friends are.
So you think with all this browsing history attached to our name, that these companies would at least say what they are doing with the data. And at that time, I looked up what they were doing, and all I found was 404 pages1. There was nothing: Facebook didn’t say what they were doing with the data, nor Google, nor Twitter.
So I explained this whole scenario to this Wall Street Journal reporter, and he said: “Well, that’s kind of interesting, but how big of a problem is this really? I mean, can you quantify how much of a browsing history they are really getting?” And I said: “Hmm… That’s a really good question, good luck finding out the answer.” But he was a reporter, so he kept asking me over and over. And finally I figured I would answer the question in a way that Google would answer the question – by writing a web crawler to figure out the prevalence of all these tracking companies.
A reverse-tracking spider
Our goals with this crawler were to get a list of the most popular sites on the web, and then to go to each of those and crawl them to a link depth of one. So the way a search engine crawler normally works is it crawls at least to a link depth of three, which means they go to a home page, get all the links on that page – 1; then go to all those pages, get all the links on those pages – 2; and then go to all those pages and get all the links on those pages – 3.
– indexed 1,000 sites
– analyzed 201,358 pages
– identified 6,926 third parties
And so for all those pages that we got, we were going to extract the third-party domain names from all the resources on those pages that sent these HTTP requests with the referrer URLs. I ran this thing over the course of the week. I decided I was gonna run it out of Starbucks, just for fun. And after a week, we ended up indexing 1,000 most popular sites. We analyzed just over 200,000 pages. And on these 1,000 sites, we identified nearly 7,000 different third parties.
And the output of this crawler was a really big ugly spreadsheet, so I’ve broken the results out into more viewable chunks here.
So the first set of stats that we are going to look at here are the non-social services: things like advertising, analytics and content services (see image). These are the services that we could presume are anonymous. The first thing I want to point out is how prevalent they are. So the top service here appeared on 23% of the top 1,000 websites. Essentially, they are seeing 23% of our browsing history. So if you think about opening your web browser, going to your browsing history, randomly picking 23% of the pages in there, and then sending them in this case to Googleapis.com – that’s basically what we are already doing.
The next thing I want to point out is how much Google stuff there is. So the top 5 services are all from Google. And the way we did this analysis is that we broke out each service separately. But other researchers had looked at it as an aggregate. For example, they found that some Google services appear on 97 different sites out of top 100 sites – it’s pretty amazing just how prevalent Google stuff is.
And the last thing I want to point out gets the anonymousness issue, which is that most of the services on this list are part of big data companies that also have personal information. So Google obviously has personal information when we log into things like Gmail and Docs, and so forth; Adobe has personal information – they have Photoshop online; Amazon obviously. Just under the top 10 on this list is AddThis1 which was purchased by Microsoft and obviously has personal information.
So at any point, these big data companies can decide to link up their anonymous data sets with their personal data sets. And what that would mean is that not only is your browsing history going forward being tracked, but all past 15+ years of your browsing history could instantly be attached to your name.
And this is not some hypothetical scenario. It’s actually something that happened at Google a couple of years ago. The Wall Street Journal published some leaked documents where Google was debating linking up their personal and anonymous data. So it’s something that could actually be happening already or certainly can happen in the future.
This next set of stats is the social services: everything that does have your name (see image). And you can see that Facebook is hugely prevalent. They are on a third of the top 1,000 websites. The really amazing thing about this number is that at the time we did this analysis, the Facebook ‘Like’ button had just turned 1 year old. So in 1 year, they went from 0% to 33%. Likewise, when we did this analysis, Google which was on a quarter of all the top 1,000 websites didn’t have the ‘+1’ button yet.
These stats are really just directional, they are probably going to increase hugely over the next year or so. And the stat I was probably the most surprised about was Twitter – their social widgets were younger than Facebook, and they were already on the fifth of the top 1,000 websites. So these guys are getting a huge chunk of our browsing history with our names. In summary, we identified 350 different services that get at least 1% of our browsing history. We identified 33 that get at least 5%; and 16 that get at least 10%.
Tracking the trackers
Now, this data ended up getting published in this Wall Street Journal article (see screenshot). Some longer tail data got published in a CNN article. But like I said, this data was directional, we wanted people to be able to see what was happening on an ongoing basis. So we’ve created this tool that we are putting out today, and the URL is db.disconnect.me.
We are trying to accomplish 2 things with this tool. All of those sets of stats that I just went over quickly were thrown into this tool. So we have a set of automated stats about all the top websites, we have a list of them in here (see screenshot). You can drill down and look at specific data on any site. For example, Yahoo has 75 different unique third parties on their site. When you go to an average Yahoo page, there are almost 5 different third parties on the page, which means that not only are you sending your browser history obviously to yahoo.com, which is where you are, but your browsing history is going to 5 other places.
The second thing that we wanted to address with this tool is the problem that I mentioned earlier, which is that, while we can see where our data is going, we can see that it is going to Facebook or Google, they don’t really do a good job telling us what they are doing with our data.
So we’ve teamed up with Mozilla to work on the ‘Icon’ project, where we can turn every website into a set of ‘privacy icons’ that make it easy to identify what they are doing once they actually get our data. If we go back and look at this Yahoo page here (see screenshot), you can see we have these 4 icons representing whether Yahoo is selling our data, how readily they turn it over to authorities, and how long they keep that data.