Former Google engineer Brian Kennish delivers a speech at Defcon about the scope of user tracking being conducted by large media companies.
My name is Brian Kennish. I am gonna be talking about how our web browser history is leaking into the cloud. I never actually talk about myself much at events like this, but I think, given the topic, I have to do a little bit here, and it’s really gonna be more of a confession than autobiography.
About 10 years ago I showed up at DoubleClick, and my job was to figure out mobile advertising. And at that time, no one knew about mobile advertising, but I especially had absolutely no clue.
So I used DoubleClick’s money to get hold of every mobile device in the world that I could. I got a big pile of ugly phones, but there were a couple of cool-looking Japanese phones too. I plugged these things into a proxy server to see what data they were sending, and what data we could target ads against. I still clearly remember being kind of shocked when I saw that these things were transmitting location. And I thought to myself: “Why the hell would anyone want DoubleClick to know almost exactly where they are, to see an advertisement?”
But there I was, and of course I figured advertisers would be into this stuff, so I put it in our mobile ad server. It turned out we were like 7 years too early on mobile advertising.
And not being satisfied just working at the biggest data collection company in the advertising space, I went to the biggest data collector in the world – Google. I was an engineer at Google for a long time, worked on a lot of stuff, and mostly ad stuff as well: AdWords and AdSense, and later on I worked on Wave. The last thing I was working on was Chrome.
So about 10 months ago, while I was happily working on the Chrome team, I studied this article in the Wall Street Journal which was about Facebook leaking personally identifiable information to third-party app developers (see image). And it sort of got me thinking about the huge amount of data that Facebook was collecting about us, specifically all the data that they were collecting sort of in an invisible way, when we weren’t on Facebook.com.
So I went home that night and whipped up this quick Chrome extension called ‘Facebook Disconnect’ (see image). I spent about 4 hours doing this thing, and I thought it was really like a throw-away thing. People seem to be impressed when I tell them it took me 4 hours. But to be honest, I spent 2 and a half of those hours making the logo of this extension. And the entire code base of that thing is actually like 20 lines of code – that’s pretty embarrassing.
I had done a couple of personal browser extensions up to that point, I think one of them had something like 36 users. I just released this thing and figured there might be a world wide audience for like 50 people, the size of a football team. But within 2 weeks, there was an entire stadium full of people using this thing. More than 50,000 people had installed and were running it.
And that got me thinking: “Hey, maybe people actually care about this privacy stuff. I know I do.” So I wanted to do a follow-up extension that did more than just stop your data from going to Facebook. But the problem was, again, I was working at the biggest data collector in the world.
I asked the lawyer what would happen if I did a broadened extension that included, you know, depersonalizing stuff on Google. And he said I would probably get sued. I didn’t like that idea, so I quit Google and spent 3 weeks making this follow-up extension which I just called plain ‘Disconnect’. That stopped your browsing history from going to all the major social networks, and it also depersonalized your searches. So if you did a search on Google or Yahoo, it wouldn’t be tied to your name anymore.
So anyway, this stuff got a little bit of press attention, in particular a reporter from the Wall Street Journal asked me if I thought there were any big privacy stories that hadn’t been told yet. And I said: “Yes, social widgets.”
I explained to him what was going on in social widgets space, and I went in something like this. I am gonna do this very quickly because it’s a little bit of Web 101.
So we go to a web page, and the web page might contain some sensitive information, in this case this page is about depression treatment (see image). Besides the first-party content – the actual article on this page, – there is a bunch of third-party widgets and content on this page, one in particular here is an advertisement.
In order for your browser to render this ad, it sends a request to the ad server. And the request is just a bunch of plain text that looks like this (see image). Obviously, it tells your browser where to send this request, in this case this is an ad from DoubleClick. The request also contains this thing called Referrer URL which tells the server where the request came from. And in this case, it tells the server that we were looking at this page about depression treatment, it has the URL for that page. And finally, there can be a bunch of cookies in the request. In this case, one of the cookies has an ID in it. So this ID uniquely identifies me.
Now, most people are probably okay with this set of data that is being sent to an ad server, because presumably this number isn’t uniquely attached to me: it’s not my name, it’s just a random set of numbers. Well, I’ll talk later why that is maybe not such a good assumption anymore. But for the last 15 years or so, we could have assumed that this was anonymous information that was being sent to the ad server.
If we go back to this page and look at what else is on this page, we also have this bunch of social widgets. So we have stuff from Facebook and Twitter, and the new Google+ button. And if we look at the requests that get sent out, in this case it’s gonna look really similar (see image below).So here we are looking at the request for the Facebook widget, it’s going to Facebook.com. We get that identical referrer URL, and finally, again, we have a cookie with a unique ID in it. Now, this looks almost identical to the request that we just looked at. But there is a huge difference here. The difference is that this ID is no longer just a string of numbers, it actually points to my Facebook profile. So it’s not just my browsing history with a set of numbers that Facebook is getting, it’s like Facebook is actually getting my name. They know that Brian Kennish is actually looking at that page. And not only are they getting that information, they are getting all the other information that I’ve exclusively given them, like my age, and where I live, and who my friends are.
So you think with all this browsing history attached to our name, that these companies would at least say what they are doing with the data. And at that time, I looked up what they were doing, and all I found was 404 pages1. There was nothing: Facebook didn’t say what they were doing with the data, nor Google, nor Twitter.
So I explained this whole scenario to this Wall Street Journal reporter, and he said: “Well, that’s kind of interesting, but how big of a problem is this really? I mean, can you quantify how much of a browsing history they are really getting?” And I said: “Hmm… That’s a really good question, good luck finding out the answer.” But he was a reporter, so he kept asking me over and over. And finally I figured I would answer the question in a way that Google would answer the question – by writing a web crawler to figure out the prevalence of all these tracking companies.
1 – 404, or Not Found error page, is an HTTP standard response code indicating that the client was able to communicate with the server, but the server could not find what was requested.