How Our Browsing History Is Leaking into the Cloud 2

Read previous: How Our Browsing History Is Leaking into the Cloud

In this section, Brian Kennish provides stats obtained by crawling popular sites to see the scope of ongoing personal data collection by big data aggregates.

A reverse-tracking spider

Our goals with this crawler were to get a list of the most popular sites on the web, and then to go to each of those and crawl them to a link depth of one. So the way a search engine crawler normally works is it crawls at least to a link depth of three, which means they go to a home page, get all the links on that page – 1; then go to all those pages, get all the links on those pages – 2; and then go to all those pages and get all the links on those pages – 3.

Crawl stats:

– indexed 1,000 sites

– analyzed 201,358 pages

– identified 6,926 third parties

But since I no longer worked at Google and didn’t have access to a million computers anymore or unlimited bandwidth, we figured for the sake of this experiment, if we got enough to get a small sample and do a link depth of one.

And so for all those pages that we got, we were going to extract the third-party domain names from all the resources on those pages that sent these HTTP requests with the referrer URLs. I ran this thing over the course of the week. I decided I was gonna run it out of Starbucks, just for fun. And after a week, we ended up indexing 1,000 most popular sites. We analyzed just over 200,000 pages. And on these 1,000 sites, we identified nearly 7,000 different third parties.

And the output of this crawler was a really big ugly spreadsheet, so I’ve broken the results out into more viewable chunks here.

Crawl stats by non-social services

Crawl stats by non-social services

So the first set of stats that we are going to look at here are the non-social services: things like advertising, analytics and content services (see image). These are the services that we could presume are anonymous. The first thing I want to point out is how prevalent they are. So the top service here appeared on 23% of the top 1,000 websites. Essentially, they are seeing 23% of our browsing history. So if you think about opening your web browser, going to your browsing history, randomly picking 23% of the pages in there, and then sending them in this case to Googleapis.com – that’s basically what we are already doing.

The next thing I want to point out is how much Google stuff there is. So the top 5 services are all from Google. And the way we did this analysis is that we broke out each service separately. But other researchers had looked at it as an aggregate. For example, they found that some Google services appear on 97 different sites out of top 100 sites – it’s pretty amazing just how prevalent Google stuff is.

And the last thing I want to point out gets the anonymousness issue, which is that most of the services on this list are part of big data companies that also have personal information. So Google obviously has personal information when we log into things like Gmail and Docs, and so forth; Adobe has personal information – they have Photoshop online; Amazon obviously. Just under the top 10 on this list is AddThis1 which was purchased by Microsoft and obviously has personal information.

So at any point, these big data companies can decide to link up their anonymous data sets with their personal data sets. And what that would mean is that not only is your browsing history going forward being tracked, but all past 15+ years of your browsing history could instantly be attached to your name.

And this is not some hypothetical scenario. It’s actually something that happened at Google a couple of years ago. The Wall Street Journal published some leaked documents where Google was debating linking up their personal and anonymous data. So it’s something that could actually be happening already or certainly can happen in the future.

Prevalence of social services

Prevalence of social services

This next set of stats is the social services: everything that does have your name (see image). And you can see that Facebook is hugely prevalent. They are on a third of the top 1,000 websites. The really amazing thing about this number is that at the time we did this analysis, the Facebook ‘Like’ button had just turned 1 year old. So in 1 year, they went from 0% to 33%. Likewise, when we did this analysis, Google which was on a quarter of all the top 1,000 websites didn’t have the ‘+1’ button yet.

These stats are really just directional, they are probably going to increase hugely over the next year or so. And the stat I was probably the most surprised about was Twitter – their social widgets were younger than Facebook, and they were already on the fifth of the top 1,000 websites. So these guys are getting a huge chunk of our browsing history with our names. In summary, we identified 350 different services that get at least 1% of our browsing history. We identified 33 that get at least 5%; and 16 that get at least 10%.

Tracking the trackers

Wall Street Journal article on user tracking

WSJ article on user tracking

Now, this data ended up getting published in this Wall Street Journal article (see screenshot). Some longer tail data got published in a CNN article. But like I said, this data was directional, we wanted people to be able to see what was happening on an ongoing basis. So we’ve created this tool that we are putting out today, and the URL is db.disconnect.me.

Disconnect DB stats about data harvesting on major sites

Disconnect DB stats about data harvesting on major sites

We are trying to accomplish 2 things with this tool. All of those sets of stats that I just went over quickly were thrown into this tool. So we have a set of automated stats about all the top websites, we have a list of them in here (see screenshot). You can drill down and look at specific data on any site. For example, Yahoo has 75 different unique third parties on their site. When you go to an average Yahoo page, there are almost 5 different third parties on the page, which means that not only are you sending your browser history obviously to yahoo.com, which is where you are, but your browsing history is going to 5 other places.

The second thing that we wanted to address with this tool is the problem that I mentioned earlier, which is that, while we can see where our data is going, we can see that it is going to Facebook or Google, they don’t really do a good job telling us what they are doing with our data.

Report on Yahoo.com at Disconnect DB

Report on Yahoo.com at Disconnect DB

So we’ve teamed up with Mozilla to work on the ‘Icon’ project, where we can turn every website into a set of ‘privacy icons’ that make it easy to identify what they are doing once they actually get our data. If we go back and look at this Yahoo page here (see screenshot), you can see we have these 4 icons representing whether Yahoo is selling our data, how readily they turn it over to authorities, and how long they keep that data.

And this is actually a crowdsource project. We have a Wiki-based platform here. You can go read the privacy policy of any of the sites that we have in here, and then set the icons according to what they are doing. We already have a JSON2 API, so we are hoping to make this widely available to other tools beyond our own.


1AddThis is a widely used social bookmarking service that can be integrated into a website with the use of a web widget. Once added, visitors to the website can bookmark an item using a variety of services, such as Facebook, MySpace, Google Bookmarks, Pinterest, and Twitter.

2JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write, and easy for machines to parse and generate. JSON is based on a subset of the JavaScript Programming Language.

Like This Article? Let Others Know!
Related Articles:

Comments are closed.

Comment via Facebook: