Data Mining a Mountain of Zero Day Vulnerabilities

Chris Wysopal Black Hat Europe 2012 conference guest Chris Wysopal, the CTO and Co-founder of Veracode, presents his research on the different sorts of prevalent and potentially exploitable web application vulnerabilities derived from the large data set that was processed by his company.

I’m Chris Wysopal, CTO and Co-founder of Veracode, and today I am going to be talking about application security and bring a lot of data to bear to get an understanding of the state of what is going on there with software developers all over the world.

The data set used in the research

The data set used in the research

The data set that I have comes from Veracode Service. We’ve scanned almost 10,000 applications over the last 18 months, and the data comes from our static analysis of the software that our customers sent to us to test.

The definition of an application is pretty vague, I’ve seen a lot of arguments about web applications, what’s a web application, because it’s certainly not a web site, there could be lots of web applications running behind a domain. It’s easier when it’s a pack, a piece of software you install, but in general it varies hugely.

We’ve looked at applications as small as 100 KB, like small little mobile apps or small little web apps, to apps that have been 6 GB in size. So you can imagine something that has that much code definitely has at least one vulnerability in it.

Now, the software we looked at was software that was both in production, and software that was in pre-production, it was tested before release – sometimes it happens, and hopefully it happens – and lots of different sources of software:
• Internally developed software, for enterprises building their own software;
• Open source that our customers want to use and want to understand the vulnerabilities in it;
• Vendor code, code that they’re purchasing. They’re looking at the vulnerabilities in that code before they actually pay the vendor for it;
• And, finally, we’ve been doing a little bit of looking at outsource code, but it’s a pretty small sample size.

In a few places I talk about outsource code, but in general outsourcers’ contracts prohibit their customers from actually doing any kind of security acceptance testing, because these contracts were written 5 to 7 years ago. I think that is going to change in the future, but right now a lot of open source code just has to be accepted by the customer, and if the customer finds vulnerabilities, then they say: “Oh, well, you need to pay us to fix them, because that’s a re-work cost”. So, have a little bit of data on that.

Collected data by types

Collected data by types

These are the types of data we collect (see image). Over here we have the application metadata, so we have some information about the application: what industry it is for, who supplied the software, who built it, we have some data on the type of application, especially for vendor, commercial vendor code, because we can just by the name of the application and the company look up what the software does, that’s not so easy to do on internally developed code. And of course we know things like language and platform. And we have our customers specify an assurance level based on their own risk scoring – do they see this as a critical or a high-risk app, or do they see it as a medium or low-risk app.

Of course we have scan data, that’s the data that we generate when we scan it: when was it scanned, what was the date, how many times it’s been scanned, and what are the findings.

And then we have metrics, which is really what I want to talk about, which is counts, percentages, time between scans, days to remediation, and then does it comply with some industry standards out there like OWASP Top 10 or CWE/SANS Top 25, if you put a policy on that and said: “The app has to have no OWASP Top 10 vulnerabilities in it, did the app comply or not?”

Tested applications by supplier type and language family

Tested applications by supplier type and language family

So, this is the data set (see image). You can see, most of the code we look at is internally developed code, this is enterprises sending us the code that they are building and they are going to operate, to make sure their developers did a good job.

And then, pretty significant is commercial code, this is code that’s being purchased by an enterprise, and during the purchasing process they’re going to have us test it. And then there’s also some open source, and then, as I said before, the outsource which is a pretty small slice.

This gives you a good idea of what languages enterprises are running, what kind of software they’re running, what kind of languages it was written in, and you can see Java is by far the most popular, followed by .NET, C/C++, PHP, ColdFusion. Surprisingly there’s a lot of ColdFusion development still out there, 2% is about 200 applications, that’s a significant number of applications.

Some of these platforms are pretty new to us. Android and iOS we just introduced last year, so we don’t have a lot of data on Android and iOS; even less on iOS, I have a little more data on Android apps that I’ll show you.

The Latent Vulnerabilities vs. the Attacks

A lot of organizations do incident response, they do managed security services, so they have data about the attacks that are happening. If you look at the Verizon data breach report, Trustwave has a report now, they talk about, when they do incident response, what the attack vector was, what was attacked, how they were attacked.

So, what we are doing here is we are talking about the latent vulnerabilities that are out there in the software, so it’s a very different view of the software landscape. It’s not what’s getting exploited; it’s what could be exploited.

Patterns of prevalent vulnerabilities being used in actual attacks

Patterns of prevalent vulnerabilities being used in actual attacks

I did a comparison here of the top 5 most attacked web vulnerabilities (see image). And I found that the web hacking incident database, from the Web Application Security Consortium, actually had the best data that actually broke it down. A lot of the other reports just say it was a hack – it wasn’t a known vulnerability, it was a hack. They categorize that as, you know, someone found a SQL injection, but they don’t break it down from this, so it’s not very useful. So it would be great if Verizon or Trustwave could break it down, because then we could understand better what classes of vulnerability are being attacked at the software layer.

Web hacking incident database had the best data out there, and so the top 5 attacks for web apps: by far, number 1 was SQL injection at 20%. In orange is the percentage of web applications that are affected by this that we saw, so on our set of 10,000 applications, you know, 32% in the orange here had 1 or more SQL injection vulnerabilities in them. And from the web hacking incident database we see here that 20% of attacks on websites are SQL injections. So you can see that it’s obviously a popular attack and it’s in a lot of applications, so that makes sense.

Cross-site scripting is a vulnerability that is sort of under-attacked.

But then you go and look at something like cross-site scripting (XSS), and we find cross-site scripting in 68% of the applications that we look at. But only 10% of the web attacks have cross-site scripting involved. So I think what you can kind of gather from it is that even though there’s a lot of cross-site scripting vulnerabilities out there, they’re not that useful to attackers. They’re not as impactful as, say, SQL injection. It’s a vulnerability that is sort of under-attacked. There’s lots of it out there. So, if you put your black hat on you could say: “Developers aren’t so concerned about cross-site scripting, so maybe I should figure out ways to better leverage and get better impact out of the cross-site scripting vulnerability.”

If you look at the next category that was information leakage, we found that in 66% of apps, and that’s only involved in 3% of hacks. So, again, that’s something that’s very common, but, I guess, attackers don’t find it that valuable – to attack with information leakage.

Cryptographic issues is another one we find a lot of – 53%, and only 2% of attacks involve cryptographic issues. Now I look at that, and if I put my black hat on I think: “Well, that’s probably because people don’t know how to exploit these issues very well, and if there were better techniques developed around finding and exploiting cryptographic issues, it would probably be a bigger problem.” So cryptographic issue is kind of a bundle of things, like SSL not being implemented properly; not doing certificate checks; allowing falling back to weaker ciphers; things like poor random number generation for security, critical values, things like that.

And then finally command injection – we only found it 9% of the time, but it was actually used in 1% of the attacks. So again, this is something that is showing that it’s probably a pretty impactful attack because of the pretty high ratio of the sort of latent ones out there being used in attacks versus what we’re seeing with information leakage and crypto – it’s really only a tiny bit that is being used.

Read next: Data Mining a Mountain of Zero Day Vulnerabilities 2: Top Vulnerability Categories

Like This Article? Let Others Know!
Related Articles:

Leave a comment:

Your email address will not be published. Required fields are marked *

Comment via Facebook: