The Effect of Password Strength Meters 4: What Makes Meters Matter?

Blase Ur provides herein the results by metrics affecting time of password creation, user sentiment, memorability, and summarizes the study overall.

Let’s move on to the password creation process. In particular, I’ll highlight the time it took the participants to create a password, and also how participants changed their mind during creation. Of course, we captured all their keystrokes to do this analysis.

Results for time of creation

Here (see right-hand image) is the box plot of time of creation: the Y axis is the time of creation in seconds, and on the X axis you see the conditions, again, color coded for the groups. And what we see here is we see a couple of conditions that are really sticking out.

Stringent meters lead to longer time of password creation

In particular, compared to our control conditions, we see 3 of our 4 different stringent meters are much higher – that is, participants in those conditions spent more time creating the password (see left-hand image). So of course, our next question might be: what are they doing during that time? Are they just sitting there wondering why the meter is telling them their password is not very good, or are they doing something else?

Peculiarity of how participants changed their mind

And so, we looked at how participants changed their mind during the creation process. The way we defined a participant to have changed his or her mind was the following: first they had to type a password of 8 or more characters; they created a password that met the minimum requirements, the state of requirements to proceed. And then they completely erased it and went back to step 1. And then they typed in the password and eventually saved something that was different from what they had originally typed. And so, participants who did this, we said they changed their mind.

Detailed breakdown of the ‘Changed Their Mind’ scenario

Here is the graph of the percentage of participants in each condition who changed their mind (see left-hand image). What you see here is, if we look at our two control conditions, between 10% and 20% of participants in those conditions changed their mind during the password creation process.

Meters cause people to change their mind when creating passwords

However, participants in all four stringent conditions, as well as who saw the dancing bunny or the condition in which we pushed them only towards longer passwords, changed their mind at a higher rate. So, for instance, in 3 out of the 4 stringent conditions – half-score, one-third-score, and our bold text-only half-score condition – over half of participants changed their mind by our definition during the password creation process (see graph to the right). So, what we’re seeing here is that meters lead people to change their mind and really do something substantially different during the creation process.

So, I’ve presented a bunch of results about what passwords look like, and we found that with the stringent meters, with visual bars, participants created passwords that were harder to guess. Of course, if those passwords are also harder for participants to remember, what have we actually achieved? So we looked at a number of memorability metrics. For instance, our participants were able to successfully log in with their password about 5 minutes after they created it, and also, 2 or more days later, when they return to the second part of the study.

Results on password memorability

Participants returned for the second part of the study; we hypothesized that if a participant created a ridiculous password that they had no chance of remembering, maybe they wouldn’t even bother coming back for part 2. Also we looked at the proportion of participants who answered in our survey that they wrote their password down or stored it electronically, or whom we observed pasting in their password. And so, what we found here is not much: we found no significant differences across conditions for any of these memorability metrics.

This was both surprising and also good. We expected, while participants are making longer passwords, participants with the stringent meters and visual bars are making harder-to-guess passwords, surely, they must not be able to remember these, but that’s not what we found. We didn’t find significant differences across conditions, which is a good thing.

Evaluating user sentiment

So, finally, when we talk about participants’ sentiment: in addition to an open-ended response, participants rated their level of agreement with 14 different statements about both password creation – such as is it fun, difficult, annoying to create a password in this scenario – and also about the password meter, such as yes/no, agree/disagree that the meter gives me an incorrect score, or that’s important to me that my password gets a high score from the meter. And responses were given on a 5-point scale, from Strongly Disagree to Strongly Agree (see image above).

Where we found the biggest differences in our sentiment results was with the stringent meters. We found that participants found stringent meters a bit more annoying than the non-stringent meters, and I’ll go into more detail in a moment with this result. And we also found that participants believed the stringent meters to have violated their expectations. They basically didn’t think these meters were correct; they found it actually less important that the meter give them a high score.

Meter annoyance scale

Just to give you a little bit more of the concrete sense of this, let’s look at participants’ responses to “Password strength meter was annoying” (see images). Each condition will be represented by a horizontal bar. On the left-hand side in the red you see the participants who agreed with the statement: “Yes, the meter was annoying”,

Results on whether participants found the meters annoying

which in this case was about 13%; and in the right-hand side in the two shades of blue, see those who disagreed or strongly disagreed, which in this case was the majority of participants. In the middle in grey you see those who were neutral.

Again, I’ll bring our conditions in our different groupings (see left-hand image). So, if you look at the visual groupings, you really see not much difference from the baseline meter; none of these conditions was significantly different than the baseline.

Stringent meters appear to be much more annoying than baseline meter

Where we do see differences is with the stringent meters. So, for instance, while 13% of participants who saw the baseline meter agreed: “Yes, the meter is annoying”, between 27% and 40% of participants who saw the stringent meters agreed: “Yes, the meter is annoying” (see image).

So, this tells us: first of all, yes, people seem to be paying attention to the meters, and while you might say it’s not necessarily a great thing for them to be a little bit more annoyed, it’s still a minority of people, and by our other metrics they still seem to remember their passwords.

I’d like to conclude with some of our main takeaways. We came into this study with kind of the overriding question of “Do meters do anything?” And what we did find is yes, meters do matter. We found meters lead to longer passwords; in the case of our stringer meters – even longer passwords.

Of course, passwords are different. Are they actually more secure? In particular, our stringent meters led participants to create passwords that were harder to guess. However, we didn’t observe significant differences in memorability. So, participants are creating more secure passwords, but they still seem to be able to remember them.

Main takeaways

You might be saying: ok, well, now we should just make all of our meters super stringent, and no one will have a bad password. What we found though is that overly stringent meters don’t seem to add benefits. So, if we take our one-third-score meter, which was our most stringent meter, and compare to, say, our half-score meter, we didn’t find any extra security benefits, but we did find participants to be more annoyed and actually to trust the meter less, to say “It’s not as important that it gives me a high score”. So, that may even backfire.

So, what makes meters matter? The two important features that we found were:

1. The stringency of scoring. It was our stringent meters that performed best, and this is particularly interesting, since the stringent meters aren’t what we observed in the wild. It was our baseline meter that most closely represented what was in the wild, and to really get these extra benefits we could have more stringent meters.

Features by importance

2. Having a visual component seemed to be important. Our text-only conditions, both with normal and half-scoring, didn’t perform as well.

However, what we found to be our less important features were the visual elements: the color, the segmentation of the meter, the size of the meter. This was surprising: we had this gigantic meter, and that didn’t seem to be any better than having a really tiny meter. And it also didn’t seem to be really important whether we had a bar or we had a meter with lots of bunnyness. So, thank you very much and I’ll be happy to take questions.