How Our Browsing History Is Leaking into the Cloud

0
131

Brian Kennish Former Google engineer Brian Kennish delivers a speech at Defcon about the scope of user tracking being conducted by large media companies.

My name is Brian Kennish. I am gonna be talking about how our web browser history is leaking into the cloud. I never actually talk about myself much at events like this, but I think, given the topic, I have to do a little bit here, and it’s really gonna be more of a confession than autobiography.

About 10 years ago I showed up at DoubleClick, and my job was to figure out mobile advertising. And at that time, no one knew about mobile advertising, but I especially had absolutely no clue.

DoubleClick’s website as of period of Brian’s employment there
DoubleClick’s website as of period of Brian’s employment there

So I used DoubleClick’s money to get hold of every mobile device in the world that I could. I got a big pile of ugly phones, but there were a couple of cool-looking Japanese phones too. I plugged these things into a proxy server to see what data they were sending, and what data we could target ads against. I still clearly remember being kind of shocked when I saw that these things were transmitting location. And I thought to myself: “Why the hell would anyone want DoubleClick to know almost exactly where they are, to see an advertisement?”

But there I was, and of course I figured advertisers would be into this stuff, so I put it in our mobile ad server. It turned out we were like 7 years too early on mobile advertising.

And not being satisfied just working at the biggest data collection company in the advertising space, I went to the biggest data collector in the world – Google. I was an engineer at Google for a long time, worked on a lot of stuff, and mostly ad stuff as well: AdWords and AdSense, and later on I worked on Wave. The last thing I was working on was Chrome.

Article on privacy breach by Facebook
Article on privacy breach by Facebook

So about 10 months ago, while I was happily working on the Chrome team, I studied this article in the Wall Street Journal which was about Facebook leaking personally identifiable information to third-party app developers (see image). And it sort of got me thinking about the huge amount of data that Facebook was collecting about us, specifically all the data that they were collecting sort of in an invisible way, when we weren’t on Facebook.com.

‘Facebook Disconnect’ extension on Chrome Web Store
‘Facebook Disconnect’ extension on Chrome Web Store

So I went home that night and whipped up this quick Chrome extension called ‘Facebook Disconnect’ (see image). I spent about 4 hours doing this thing, and I thought it was really like a throw-away thing. People seem to be impressed when I tell them it took me 4 hours. But to be honest, I spent 2 and a half of those hours making the logo of this extension. And the entire code base of that thing is actually like 20 lines of code – that’s pretty embarrassing.

I had done a couple of personal browser extensions up to that point, I think one of them had something like 36 users. I just released this thing and figured there might be a world wide audience for like 50 people, the size of a football team. But within 2 weeks, there was an entire stadium full of people using this thing. More than 50,000 people had installed and were running it.

And that got me thinking: “Hey, maybe people actually care about this privacy stuff. I know I do.” So I wanted to do a follow-up extension that did more than just stop your data from going to Facebook. But the problem was, again, I was working at the biggest data collector in the world.

Website of ‘Disconnect’ project
Website of ‘Disconnect’ project

I asked the lawyer what would happen if I did a broadened extension that included, you know, depersonalizing stuff on Google. And he said I would probably get sued. I didn’t like that idea, so I quit Google and spent 3 weeks making this follow-up extension which I just called plain ‘Disconnect’. That stopped your browsing history from going to all the major social networks, and it also depersonalized your searches. So if you did a search on Google or Yahoo, it wouldn’t be tied to your name anymore.

So anyway, this stuff got a little bit of press attention, in particular a reporter from the Wall Street Journal asked me if I thought there were any big privacy stories that hadn’t been told yet. And I said: “Yes, social widgets.”

Social widgets

I explained to him what was going on in social widgets space, and I went in something like this. I am gonna do this very quickly because it’s a little bit of Web 101.

Web page on depression treatment
Web page on depression treatment

So we go to a web page, and the web page might contain some sensitive information, in this case this page is about depression treatment (see image). Besides the first-party content – the actual article on this page, – there is a bunch of third-party widgets and content on this page, one in particular here is an advertisement.

Ad server request code
Ad server request code

In order for your browser to render this ad, it sends a request to the ad server. And the request is just a bunch of plain text that looks like this (see image). Obviously, it tells your browser where to send this request, in this case this is an ad from DoubleClick. The request also contains this thing called Referrer URL which tells the server where the request came from. And in this case, it tells the server that we were looking at this page about depression treatment, it has the URL for that page. And finally, there can be a bunch of cookies in the request. In this case, one of the cookies has an ID in it. So this ID uniquely identifies me.

Now, most people are probably okay with this set of data that is being sent to an ad server, because presumably this number isn’t uniquely attached to me: it’s not my name, it’s just a random set of numbers. Well, I’ll talk later why that is maybe not such a good assumption anymore. But for the last 15 years or so, we could have assumed that this was anonymous information that was being sent to the ad server.

Social widgets on the analyzed page
Social widgets on the analyzed page

If we go back to this page and look at what else is on this page, we also have this bunch of social widgets. So we have stuff from Facebook and Twitter, and the new Google+ button. And if we look at the requests that get sent out, in this case it’s gonna look really similar (see image below).

Facebook widget request code
Facebook widget request code

So here we are looking at the request for the Facebook widget, it’s going to Facebook.com. We get that identical referrer URL, and finally, again, we have a cookie with a unique ID in it. Now, this looks almost identical to the request that we just looked at. But there is a huge difference here. The difference is that this ID is no longer just a string of numbers, it actually points to my Facebook profile. So it’s not just my browsing history with a set of numbers that Facebook is getting, it’s like Facebook is actually getting my name. They know that Brian Kennish is actually looking at that page. And not only are they getting that information, they are getting all the other information that I’ve exclusively given them, like my age, and where I live, and who my friends are.

So you think with all this browsing history attached to our name, that these companies would at least say what they are doing with the data. And at that time, I looked up what they were doing, and all I found was 404 pages1. There was nothing: Facebook didn’t say what they were doing with the data, nor Google, nor Twitter.

So I explained this whole scenario to this Wall Street Journal reporter, and he said: “Well, that’s kind of interesting, but how big of a problem is this really? I mean, can you quantify how much of a browsing history they are really getting?” And I said: “Hmm… That’s a really good question, good luck finding out the answer.” But he was a reporter, so he kept asking me over and over. And finally I figured I would answer the question in a way that Google would answer the question – by writing a web crawler to figure out the prevalence of all these tracking companies.

A reverse-tracking spider

Our goals with this crawler were to get a list of the most popular sites on the web, and then to go to each of those and crawl them to a link depth of one. So the way a search engine crawler normally works is it crawls at least to a link depth of three, which means they go to a home page, get all the links on that page – 1; then go to all those pages, get all the links on those pages – 2; and then go to all those pages and get all the links on those pages – 3.

Crawl stats:

– indexed 1,000 sites

– analyzed 201,358 pages

– identified 6,926 third parties

But since I no longer worked at Google and didn’t have access to a million computers anymore or unlimited bandwidth, we figured for the sake of this experiment, if we got enough to get a small sample and do a link depth of one.

And so for all those pages that we got, we were going to extract the third-party domain names from all the resources on those pages that sent these HTTP requests with the referrer URLs. I ran this thing over the course of the week. I decided I was gonna run it out of Starbucks, just for fun. And after a week, we ended up indexing 1,000 most popular sites. We analyzed just over 200,000 pages. And on these 1,000 sites, we identified nearly 7,000 different third parties.

And the output of this crawler was a really big ugly spreadsheet, so I’ve broken the results out into more viewable chunks here.

Crawl stats by non-social services
Crawl stats by non-social services

So the first set of stats that we are going to look at here are the non-social services: things like advertising, analytics and content services (see image). These are the services that we could presume are anonymous. The first thing I want to point out is how prevalent they are. So the top service here appeared on 23% of the top 1,000 websites. Essentially, they are seeing 23% of our browsing history. So if you think about opening your web browser, going to your browsing history, randomly picking 23% of the pages in there, and then sending them in this case to Googleapis.com – that’s basically what we are already doing.

The next thing I want to point out is how much Google stuff there is. So the top 5 services are all from Google. And the way we did this analysis is that we broke out each service separately. But other researchers had looked at it as an aggregate. For example, they found that some Google services appear on 97 different sites out of top 100 sites – it’s pretty amazing just how prevalent Google stuff is.

And the last thing I want to point out gets the anonymousness issue, which is that most of the services on this list are part of big data companies that also have personal information. So Google obviously has personal information when we log into things like Gmail and Docs, and so forth; Adobe has personal information – they have Photoshop online; Amazon obviously. Just under the top 10 on this list is AddThis1 which was purchased by Microsoft and obviously has personal information.

So at any point, these big data companies can decide to link up their anonymous data sets with their personal data sets. And what that would mean is that not only is your browsing history going forward being tracked, but all past 15+ years of your browsing history could instantly be attached to your name.

And this is not some hypothetical scenario. It’s actually something that happened at Google a couple of years ago. The Wall Street Journal published some leaked documents where Google was debating linking up their personal and anonymous data. So it’s something that could actually be happening already or certainly can happen in the future.

Prevalence of social services
Prevalence of social services

This next set of stats is the social services: everything that does have your name (see image). And you can see that Facebook is hugely prevalent. They are on a third of the top 1,000 websites. The really amazing thing about this number is that at the time we did this analysis, the Facebook ‘Like’ button had just turned 1 year old. So in 1 year, they went from 0% to 33%. Likewise, when we did this analysis, Google which was on a quarter of all the top 1,000 websites didn’t have the ‘+1’ button yet.

These stats are really just directional, they are probably going to increase hugely over the next year or so. And the stat I was probably the most surprised about was Twitter – their social widgets were younger than Facebook, and they were already on the fifth of the top 1,000 websites. So these guys are getting a huge chunk of our browsing history with our names. In summary, we identified 350 different services that get at least 1% of our browsing history. We identified 33 that get at least 5%; and 16 that get at least 10%.

Tracking the trackers

Wall Street Journal article on user tracking
WSJ article on user tracking

Now, this data ended up getting published in this Wall Street Journal article (see screenshot). Some longer tail data got published in a CNN article. But like I said, this data was directional, we wanted people to be able to see what was happening on an ongoing basis. So we’ve created this tool that we are putting out today, and the URL is db.disconnect.me.

Disconnect DB stats about data harvesting on major sites
Disconnect DB stats about data harvesting on major sites

We are trying to accomplish 2 things with this tool. All of those sets of stats that I just went over quickly were thrown into this tool. So we have a set of automated stats about all the top websites, we have a list of them in here (see screenshot). You can drill down and look at specific data on any site. For example, Yahoo has 75 different unique third parties on their site. When you go to an average Yahoo page, there are almost 5 different third parties on the page, which means that not only are you sending your browser history obviously to yahoo.com, which is where you are, but your browsing history is going to 5 other places.

The second thing that we wanted to address with this tool is the problem that I mentioned earlier, which is that, while we can see where our data is going, we can see that it is going to Facebook or Google, they don’t really do a good job telling us what they are doing with our data.

Report on Yahoo.com at Disconnect DB
Report on Yahoo.com at Disconnect DB

So we’ve teamed up with Mozilla to work on the ‘Icon’ project, where we can turn every website into a set of ‘privacy icons’ that make it easy to identify what they are doing once they actually get our data. If we go back and look at this Yahoo page here (see screenshot), you can see we have these 4 icons representing whether Yahoo is selling our data, how readily they turn it over to authorities, and how long they keep that data.

And this is actually a crowdsource project. We have a Wiki-based platform here. You can go read the privacy policy of any of the sites that we have in here, and then set the icons according to what they are doing. We already have a JSON2 API, so we are hoping to make this widely available to other tools beyond our own.