Blase Ur’s presentation at USENIX ends with a Q&A part which is reflected in this entry and sheds yet more light upon the details related to password meters.
Question: I really liked the study and I applaud your large sample size. I’m wondering if you had any way of measuring user tendency to abandon trying to create a password. So the thing that occurred to me with stringency is if they find it more annoying, are they more likely to just give up? Do you have any way to measure that?
Answer: We talked about this a bit in the paper, about the idea of giving up, exactly. The first comment I would make is that as compared to something, say, password policies, meter is actually just a suggestion, it’s not a requirement. The meter could have said your password is awful, and you could still proceed, which is the case with meters.However, we did look at participant tendencies to try and fill the meter or reach milestones or give up, and just a really quick way to look at this is if you look at the password length, you could see that baseline meter is some length; half-score meter – people should make longer passwords, one-third-score meter – they should make even longer passwords. But that’s actually not what we found: condition #10, the leftmost green one, was our half-score meter, and condition #11 was our one-third-score meter (see graph). And it’s actually half-score where we seem to actually have the longest passwords.
This was really mirrored in our sentiment results: with the one-third-score meter we found participants less likely to say: “It’s important that I get a high score”, so exactly, you push them a lot – and they indeed give up.
Question: It seems odd that your guessability metrics took into account dictionary attacks, implicitly in the way that you’re doing them, but none of your meters complained about dictionary words in the passwords?
Answer: So, we did have in the meters the notice that if it was in the OpenWall mangled wordlist, the cracking dictionary, then we did tell them: “Your password is in our dictionary of common passwords”. In some sense, it’s what feedback do you give to people, then how do you evaluate the security of passwords. The algorithm that we used is as a training set, then from that training set computers compute the order likelihood of guesses. There are many ways to evaluate it, and actually our choice of a guessability metric was motivated by some of our past work, our group’s paper at Oakland this year, in which we compared a bunch of different guessability metrics and found this to perform the best.
Question: Great work! My only question is: between the enforced 8-character minimum and the fact that participants likely knew that passwords were the subject of this study, if you could comment at all on ecological validity?
Answer: Oh yes, absolutely. Thank you for bringing up the ecological validity, which in any user study is always very important. I think there’re arguments from both sides for ecological validity here. So you can say: “Ok, participants knew they were taking part in the study about passwords, perhaps they paid a lot more attention to this”, – in that case this would absolutely be kind of an upper bound on the participants’ attention.
On the other hand, if we take a step back and say: “All they had to do is create an 8-character password and move on, and then they still get paid, whether they create an 8-character password or 40-character password, and they’re not actually protecting any high-value account”. In that sense it was kind of surprising to us: like, why would anyone actually pay attention to the meter?
And just to take your point on the ecological validity, which is, of course, very important, and expand a little bit further, the ideal circumstance for this study would have been if I were able to control a major website’s account creation page and run an experimental study – that would have been really cool. Coming into this, we as a community didn’t really have that much sense of what, if anything, these meters do, and what we have is a kind of progress, and I think there’s actually a lot of room for further studies even focusing more on ecological validity in this space to really add to this work.