In this entry Blase Ur walks us through the first two metrics for the study, namely the relation of password length and meter type, and results by guessability.Before I jump into our results, I’ll tell you a little bit about our participants. We had 2,931 of them recruited on Amazon’s Mechanical Turk crowdsourcing service. Our participants were biased male: 63% male; and also biased technical: 40% of our participants said they had a degree or job in IT, computer science, or a related field. Participants ranged from 18 to 74 years old and hailed from 96 different countries, with 42% of our participants from India and 32% from the US. For our results section I’ll go through 5 main metrics, and I’ll just give you some highlights for each of these metrics; there is a lot more in the paper. First I’ll talk about composition of passwords participants created; then about the guessability of these passwords, then the password creation process, memorability of the passwords, and finally, participants’ sentiment. And as I go through, I’ll tell you a little bit more about why we chose these metrics. So, for the first metric results today I want to just highlight password length (see right-hand image). Here you can see: on the Y axis we have password length, and on the X axis are different conditions. And we’ve color coded the different groups I just presented to you: our controls, ones with visual differences, ones with scoring differences, and then ones with both visual and scoring differences. The leftmost box over there (see left-hand image), that’s our no meter condition, so, no feedback. And our other 14 conditions are those that have meters. And what you’ll notice is that conditions with meters generally have longer passwords; in fact, in 13 of these 14 conditions with meters, the passwords created were statistically significantly longer than those created without a meter. In particular you might see some of the green boxes jumping off a little bit, and those are our half-score and one-third-score conditions, the two stringent meters with visual bars. What we saw from this is – with the meter participants created longer passwords. In security community, people are creating slightly different passwords here, but the question of importance to us is whether they are creating more secure passwords.
Our next metric was guessability of passwords, that is, how many guesses would it take to crack the password they created, and our threat model was an offline attack. It’s happened all too often in recent years: databases compromised, even if passwords are correctly salted in hash, someone wants to try and guess them. The brute-force way of going about this would be to maybe start with guessing 8 A’s in a row, followed by 7 A’s and a B. Perhaps a smarter way to do this would be to guess passwords in order of their likelihood: so, maybe, the first guess might be ‘12345678’, the next guess might be ‘password’.
And there’s this latter, the smarter approach that we used to evaluate the strength, or the guessability, of passwords. In particular, we used a cracking algorithm proposed at Oakland three years ago, as implemented in a Guess number calculator – essentially, a giant lookup table that was presented at this year’s Oakland.For analysis we looked at what we termed “three different adversaries”, we have what we call our Weak Adversary, who makes 500 million guesses of passwords; our Medium Adversary, who makes 50 billion guesses, and our Strong Adversary, who makes 5 trillion guesses. And to give you some sense of how this number of guesses translates to actual resources, and these will be really rough numbers. Let’s say if we had about 100 CPU cores running at pretty much full utilization for about 2 weeks, we could make about 5 trillion guesses, and that’s, of course, very hardware implementation and hashing algorithm dependent. So, our guessability I’ll present with the following graph (see left-hand image); each condition will be a line on the graph. On the X axis we have the number of guesses on the algorithmic scale, and on the Y axis we have the percentage of passwords cracked by that guess number in each condition. So right now I just have no meter up, and so, for instance, here our medium adversary would have guessed roughly 35% of the passwords created without a meter. As I introduced the rest of the conditions, the lower on the graph is better; lower is more resistant to cracking. So, first, let me bring in the other control condition. Here is our baseline meter, that’s the red line (see right-hand graph), and you’ll see it’s a bit lower, it’s a bit seemingly more resistant to cracking. However, this difference is not statistically significant. Now let me bring in all the meters that differed visually from the baseline (left-hand image), and I’ll flash this back and forth a few times. What you see is it’s not much different than our control conditions. In particular, none of these conditions were statistically significantly different than either our controls at all three adversaries. What we see is that the visual changes don’t significantly increase resistance to guessing. Let me take these away, and now I’ll bring in the rest of our conditions, those differing in scoring, and those differing both visually and in scoring. Again, I’ll flash it a few times compared to our controls. And what we see here is that it’s a little bit lower on the graph; that’s better, it’s more resistant to cracking (see right-hand image).
And in particular we have these two conditions, half-score and one-third-score, those are the stringent meters with visual bars. In those conditions passwords participants created were statistically significantly more resistant to a guessing attack than those created without a meter, and it was actually also lower than baseline, but that difference was not statistically significant. So, what we’re seeing here is that the stringent meters with visual bars are increasing resistance to guessing.