Blase Ur proceeds with describing the workflow of the password meters study, highlighting here the impact of visual and scoring elements upon password strength.So, I just showed you a number of different features, and we, of course, wanted to know what each of these features is contributing. All of our other conditions were based on and varied from our baseline meter. In particular, we had 7 different conditions that varied from the baseline meter only in visual appearance; the scoring was the same as in the baseline meter. We had 4 conditions in which we kept the visual appearance the same, but varied the scoring from our baseline meter (see right-hand image). And then 2 conditions in which both visual and scoring elements were different than in the baseline. Now we’ll go through all these in turn.
With our visual differences – I’ll keep our baseline meter on top just for reference, and I’ll show our other visual conditions below it. Just as a reminder, no participants saw it like this. I’m also showing what I’m typing in plain text, and again, that’s not how our participants saw it in the study.So, what are the things that we wanted to look for with visual differences? (See left-hand image) Does it make a difference whether the meter is continuous or only has a few distinct segments? So we had one with three distinct segments that were either filled or unfilled. Does having the color change make a difference? So we had a meter that was always green rather than changing from red to green. Well, does the size make a difference? So we had a tiny meter – I guess, the gymnast-sized meter, and a huge meter, a sumo wrestler-sized meter. That’s a really big meter. The suggestions we were giving them – did they make a difference? So on one condition we took the suggestions away. And finally, does the bar make a difference? So we had a condition where we took the bar away and had a text-only meter. Notice: as we type in, they’re all pretty much going in lockstep, and they all fill up at the same time.
These were our conditions with visual differences, with one exception – there’s one more I’ve left out. We said: “Well, a lot of these meters we see have bars, and that’s what we observed in the wild. Does having a bar actually matter? Could we have some different visual metaphor?”And so we wanted to come up with something a little ridiculous, and we decided: “What’s more ridiculous than a dancing bunny?” So we have a dancing bunny meter. The stronger your password, the faster Bugs Bunny dances. And so we start typing, he is starting to pick up speed, and then as you keep going, he’s just kind of going crazy. So that’s out bunny meter (see left-hand image).
Our next category deals with scoring differences – and again, the baseline meter I’ll keep on top here for reference. There’re two main kinds of ideas we wanted to check. First, what if we just showed them always a lower score than in the baseline? So we had two conditions: half-score, in which we always showed half the score as in the baseline meter, and then one-third-score, in which, as you might guess, we always showed one-third the score as in the baseline meter.Next we wanted to test: what if we just always push participants towards a particular policy? Either, in the first case, having always longer passwords, pushing them towards 16 characters or more, or, in the second case, pushing them towards more complex passwords with multiple character classes. And you’ll notice, as we start typing in, at this point the baseline meter is already reading Excellent, half-score meter’s saying Poor, the one-third-score meter is saying Bad (see left-hand image). We keep going, keep going; roughly, about 30-character-long passwords would fill the half-score meter, and to fill the one-third-score meter you’d need about a 40-character password.
Then our final group of conditions were those that differed both visually and in scoring elements from our baseline meter. Particularly, visually they actually lacked a visual component: we took away the bar and had text-only conditions. In the first case, standard text; and in the second case, boldface text.We also changed the scoring; we used the same scoring as in the half-score meter, so we always showed half the score as in the baseline. So, for instance, the text-only meters right now are saying Bad, whereas the baseline meter is saying Fair (right-hand image). Similarly, close to about 30-character-long passwords would finally get a score Excellent from these half-score and bold text-only half-score meters. Four of the conditions I’ve already shown to you we gave a special name for, we called these our Stringent meters (see left-hand image), that is, those with more stringent scoring who always received a lower score than in the baseline meter. And so, from the scoring grouping it was half-score and one-third-score – these were the two that had visual bars, and then this final group of conditions I just showed you are text-only half-score and bold text-only half-score conditions. So, throughout our results I’ll refer collectively to these four meters as our stringent meters.