Read previous: How Our Browsing History Is Leaking into the Cloud
In this section, Brian Kennish provides stats obtained by crawling popular sites to see the scope of ongoing personal data collection by big data aggregates.
Our goals with this crawler were to get a list of the most popular sites on the web, and then to go to each of those and crawl them to a link depth of one. So the way a search engine crawler normally works is it crawls at least to a link depth of three, which means they go to a home page, get all the links on that page – 1; then go to all those pages, get all the links on those pages – 2; and then go to all those pages and get all the links on those pages – 3.
– indexed 1,000 sites
– analyzed 201,358 pages
– identified 6,926 third parties
And so for all those pages that we got, we were going to extract the third-party domain names from all the resources on those pages that sent these HTTP requests with the referrer URLs. I ran this thing over the course of the week. I decided I was gonna run it out of Starbucks, just for fun. And after a week, we ended up indexing 1,000 most popular sites. We analyzed just over 200,000 pages. And on these 1,000 sites, we identified nearly 7,000 different third parties.
And the output of this crawler was a really big ugly spreadsheet, so I’ve broken the results out into more viewable chunks here.So the first set of stats that we are going to look at here are the non-social services: things like advertising, analytics and content services (see image). These are the services that we could presume are anonymous. The first thing I want to point out is how prevalent they are. So the top service here appeared on 23% of the top 1,000 websites. Essentially, they are seeing 23% of our browsing history. So if you think about opening your web browser, going to your browsing history, randomly picking 23% of the pages in there, and then sending them in this case to Googleapis.com – that’s basically what we are already doing.
The next thing I want to point out is how much Google stuff there is. So the top 5 services are all from Google. And the way we did this analysis is that we broke out each service separately. But other researchers had looked at it as an aggregate. For example, they found that some Google services appear on 97 different sites out of top 100 sites – it’s pretty amazing just how prevalent Google stuff is.
And the last thing I want to point out gets the anonymousness issue, which is that most of the services on this list are part of big data companies that also have personal information. So Google obviously has personal information when we log into things like Gmail and Docs, and so forth; Adobe has personal information – they have Photoshop online; Amazon obviously. Just under the top 10 on this list is AddThis1 which was purchased by Microsoft and obviously has personal information.
So at any point, these big data companies can decide to link up their anonymous data sets with their personal data sets. And what that would mean is that not only is your browsing history going forward being tracked, but all past 15+ years of your browsing history could instantly be attached to your name.
And this is not some hypothetical scenario. It’s actually something that happened at Google a couple of years ago. The Wall Street Journal published some leaked documents where Google was debating linking up their personal and anonymous data. So it’s something that could actually be happening already or certainly can happen in the future.This next set of stats is the social services: everything that does have your name (see image). And you can see that Facebook is hugely prevalent. They are on a third of the top 1,000 websites. The really amazing thing about this number is that at the time we did this analysis, the Facebook ‘Like’ button had just turned 1 year old. So in 1 year, they went from 0% to 33%. Likewise, when we did this analysis, Google which was on a quarter of all the top 1,000 websites didn’t have the ‘+1’ button yet.
These stats are really just directional, they are probably going to increase hugely over the next year or so. And the stat I was probably the most surprised about was Twitter – their social widgets were younger than Facebook, and they were already on the fifth of the top 1,000 websites. So these guys are getting a huge chunk of our browsing history with our names. In summary, we identified 350 different services that get at least 1% of our browsing history. We identified 33 that get at least 5%; and 16 that get at least 10%.
The second thing that we wanted to address with this tool is the problem that I mentioned earlier, which is that, while we can see where our data is going, we can see that it is going to Facebook or Google, they don’t really do a good job telling us what they are doing with our data.So we’ve teamed up with Mozilla to work on the ‘Icon’ project, where we can turn every website into a set of ‘privacy icons’ that make it easy to identify what they are doing once they actually get our data. If we go back and look at this Yahoo page here (see screenshot), you can see we have these 4 icons representing whether Yahoo is selling our data, how readily they turn it over to authorities, and how long they keep that data.
1 – AddThis is a widely used social bookmarking service that can be integrated into a website with the use of a web widget. Once added, visitors to the website can bookmark an item using a variety of services, such as Facebook, MySpace, Google Bookmarks, Pinterest, and Twitter.