Marti Motoyama now focuses on the detailed comparison and features of automated CAPTCHA-solving solutions out there and third-party human-based solvers.
Let’s delve down further into the challenges facing software solvers. First of all, they require skilled programming labor, and hence can be expensive to develop and/or purchase. Specialized solvers cost on the order of hundreds or thousands of dollars, and for a thousand dollars one can purchase a million CAPTCHA solves done by humans.It’s difficult for software to achieve very high accuracy. We looked at reCAPTCHA software in our paper, and we observed that it had accuracy below 35%. And if you keep failing at a CAPTCHA, that provides evidence to the web service provider that you may be using an automated solver. These CAPTCHAs typically have very short life spans. Microsoft, for example, has aggressively changed their CAPTCHA in the last year alone. And there’re various examples of the CAPTCHAs they’ve changed just in the past year. In talking with Mr. “E”, he also expressed that he didn’t feel it was worth it to purchase or develop software solvers: the solvers don’t last, and hence are just not economically worthwhile to pursue. In his own words, it’s just a big waste of time to invest in this particular approach. And this is especially true as CAPTCHAs become increasingly more complex. Thus, as of roughly late 2006, attackers began using aggregated cheap human labor to solve CAPTCHAs (see left-hand image). For the sake of illustrating how attackers typically use human solvers, let’s suppose an attack was being mounted against Twitter. After the individual or the bot program extracts the CAPTCHA, they send it to one of these human-based solving services. In this example we’ll assume that the attacker is using Decaptcher, a website that aggregates customer request for CAPTCHA solves (see right-hand image). Decaptcher has a corresponding worker backend called PixProfit, a website which aggregates workers from low-cost labor markets to actually perform the CAPTCHA solves. PixProfit is just going to display that CAPTCHA to some worker in the world; that worker is going to type in the text, and then that solution is going to make its way back to the submitter. Using that solution, the bot can then ascertain an account (left-hand image).
Now I’m going to migrate into the meat of the talk, the human-based CAPTCHA solvers. People first began selling human CAPTCHA solving services in roughly 2006. In essence, these services sidestep the underlying assumption of a CAPTCHA that a program alone is attempting to abuse the service and that the individual who’s typing in the alphanumeric characters is actually the person who’s trying to access the resource.The fact that CAPTCHAs need to be easily solvable makes it incredibly easy to aggregate and outsource this task. Many enterprising individuals have taken note of this: businesses selling CAPTCHA solving services are spreading up all over the web, driving prices down and correspondingly the wages paid to solvers. Shown here (see left-hand image) are example wages paid to workers who solve CAPTCHAs, and that’s in units of CAPTCHA solves. So, the goal of our work is to develop a clear picture of what human-based CAPTCHA solvers are capable of doing. Really, we wanted to know the answer to the question: just how good are these human-based CAPTCHA solvers? We looked at several metrics: price, availability, response time, accuracy, and capacity. We won’t look at CAPTCHA difficulty in this talk, and for an in-depth analysis of CAPTCHA difficulty, I advise you guys to please see the Stanford paper that was presented open in 2010. If the services we look at provide all of the above, then this, really, suggests that CAPTCHAs cannot necessarily prevent wide-scale abuse. In our study we looked at 8 services (see left-hand image) that spanned a number of different price points. The cheapest services were Russian-based and charged $1 for every 1000 CAPTCHA solves. The most expensive was ImageToText, which had a bulk price of $20 per 1000 CAPTCHAs solved. And there’s a variety of reasons why these prices all differ, some of which I’ll get to later in this talk. To characterize these services, however, we needed a corpus of CAPTCHAs to submit. What we did is we periodically downloaded over 25 different CAPTCHAs from such sites as Google, reCAPTCHA, Microsoft, and Yahoo. In the end, we had about 7500 images per site in our corpus (see right-hand image).