Entirely focusing on the subject of crawling P2P botnets here, Tillmann Werner explicates the motivations for this process as well as applicable strategies.Let’s talk about crawling. Crawling is nothing else but recursively enumerating peers. You start with one peer, you request its peerlist, you take a look at the response and do the same for all the returned addresses and so on until you want to go offline or something like that. You really want to think about a crawling strategy, and one important thing is crawling speed. Ideally, we would be able to take a snapshot of the current peer-to-peer graph and then enumerate the peers in that snapshot, but that’s not possible. First off, because you have to do that actively, you have to send out requests and process the responses, and that takes time. And while you are doing that, the structure of the graph might be changing: peers might go offline, new peers might come online. So you will never be able to get that snapshot, but to come closest to that you want to be as quick as possible.
And when you do that you have to think about things like unresponsive peers: what if somebody sends you an IP address back that’s offline? How do you deal with that? Do you want to keep it in the list and try again later? I mean, you don’t know why it’s unresponsive, you might lose packets, the network might be overwhelmed with your traffic because you try to be as fast as possible. You don’t know why it’s unresponsive, so you might want to keep it in the list and try again later, but you can see it’s getting a little bit more complex.
What you see in the top right corner (see image above) is the result of us crawling P2P Zeus, which is also known as Gameover by the way. The red line shows you the number of IP addresses that we learned, we call them known peers, but most of them are not actually reachable although the protocol is pretty robust and they don’t include any invalid IP addresses in it. But most of them are not actually reachable. So, if you count only the peers that you can talk to you end up with the green line, and you can see it’s way less. If you see these little dips in the red line, that’s because for P2P Zeus we chose the strategy where we cleaned up the list of known peers from time to time, so we said: “Okay, these are unresponsive for too long now, let’s kick them out to keep the list small.” Because otherwise you’ll have an endlessly growing list. What you can also see is that the green line converges very quickly, and that means you have probably reached the number you are able to crawl, and that gives you some size estimation.You might wonder why anybody wants to crawl P2P botnets at all. It’s interesting to play with that, it’s interesting to understand the protocol and re-implement it and so on, and then play with the botnet and maybe snoop on what they are doing. But we usually have other goals. Reconnaissance is usually the foremost thing, right? But why would you want to learn something about peer-to-peer botnets and the infected machines? I’ve already mentioned size estimation. If you talk to the press, they really like high numbers, so if you tell them that, you know, ZeroAccess is 10 million infected machines large, they will love that, but next time you have to tell them the botnet is 15 million infected machines large or something.
So, yes, size estimation is one thing, but you have to be aware that you can only crawl a subset of the infected machines. Most of them, obviously, are behind NAT, behind gateways. You can’t directly talk to them, you can’t reach from the Internet, but they’re still part of a P2P botnet, they are like leaf nodes in this graph. So, it’s not trivial. If you do what we did with P2P Zeus and you end up with this green line and you get the number of machines that you can talk to, you have to extrapolate from that number to get to a more realistic size estimation.
Infection tracking is something being done by the people who want to remediate or kill these botnets. They want to learn about infected machines and then can report the IP addresses to, let’s say, ISPs who then pass the information on to their customers, and hopefully they clean up the machines so that the botnet dies off, but I’ve never really seen that being successful.
Geographic distribution is something you can also get from that. If you have all the IP addresses you can do geolocation lookups and, if you want to, plot them on a map like what we did here (see left-hand image above). And I want to mention Mark Schlosser and some other guys who created the code we based this on. This is actually a live thing, so we send it a live feed of the crawling results and that displays these little red dots.
Okay, what we are usually after is we want to attack peer-to-peer botnets, so if you know all the nodes you might want to try and send them commands yourself, sometimes interesting commands like uninstall command: you send an uninstall command to all the bots you’ve identified, and they are the ones you can talk to, so it’s the backbone of the whole graph, so to speak – then you can kill the botnet entirely. Or if you can send requests for more information about the infected machines, you can for example get information about the operating system version or other stuff, that’s usually interesting as well.
You can also manipulate the peer-to-peer infrastructure. Think about it: if you can generate your own peerlists and then propagate these into peer-to-peer network, you can, basically, tamper with that infrastructure, and we will talk about that in a little bit. Ideally, you might be able to sink the whole thing by replacing all the legitimate entries in the peerlist with your own ones and by that have all peers talking to your machine, which means that nobody else has access over them anymore.If you think about crawling strategies, you might ask yourself: “Do I want to implement a depth-first search (DFS) or breadth-first search (BFS)?” It doesn’t really matter, at least that’s what we think, because, first off, it’s not a tree, it’s a graph; I mean, you can distinguish the two strategies anyway, but it doesn’t really matter because it’s dynamic. It’s changing all the time anyway, so it doesn’t really matter which nodes you start with and which nodes you continue with. At some point, if you are quick enough, you will hopefully be able to learn the biggest part of the reachable machines.
If you track the infected machines you need to be able to distinguish: “Have I seen that IP address, or have I seen that peer before? Do I want to include it in my lists?” And if you rely on the IP addresses only, that’s a bit of a problem because, as I’ve already mentioned, there’s a lot of IP churn – you know, IP addresses that change after 24 hours; and if you happen to crawl a peer and then the IP address changes, you contact it again and you count it twice, so you want to avoid that otherwise you get screwed numbers.
Some P2P protocols are nice, they implement unique IDs, especially ones that implement overlay networks because you need them for routing. And if you have that, then you can have more accurate numbers. When you’re done with the crawling and this curve converges because you don’t learn about any new peers anymore, and if there are some changes – then it’s due to churn.