tl;dr: look up sha1 sums via https://mro.name/2022/pwned/passwords/ but beware it doesn’t use the better pwned api.
While the venerable xkcd on password strength discourages alphabet soup, there’s a thing even more important:
Don’t ever use leaked passwords
But how would you know? Troy Hunt maintains a set of leaked passwords you can test your password candidate against online or download and test locally. (Online testing does not involve uploading your password).
I show how to handle such a large dataset and have fast lookups using low profile machinery โ cheap hardware, djb’s cdb and some shell/awk scripting.
Get the dataset
be nice and download via torrent from https://haveibeenpwned.com/Passwords. I
just use http curl
however.
Slice it up
The whole set is way too big for a single cdb, so we split it into one file per first hex character of the sha1 password hashes. Expect that to run for some days and produce 16 files around 3.3G each:
#!/bin/sh
# map the first hex char of the sha to a database filename
# curl -LO 'https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v8.7z'
# sudo apt-get install p7zip
# p7zip -d pwned-passwords-sha1-ordered-by-hash-v8.7z
#
# revert:
# $ cdb -d pwned-passwords-v8-sha1-?.cdb | head | cut -d : -f 2 | sed 's/->/:/'
#
readonly raw="pwned-passwords-sha1-ordered-by-hash-v8.txt"
date
echo "segmenting"
cat "${raw}" \
| tr -d '\015' \
| tr ':' ' ' \
| awk '//{f=substr($0,1,1);print >> f;fflush(f)}'
for c in 0 1 2 3 4 5 6 7 8 9 A B C D E F
do
echo "shard ${c}"
cdb -c -m "pwned-passwords-v8-sha1-${c}.cdb" \
< "${c}" \
&& rm "${c}"
done
date
Query
A cgi is enough to look up the counter for a sha:
#!/bin/sh
do_retry () {
cat <<EOF
Status: 303 See Other
Location: .
Content-Type: text/plain
Retry
EOF
exit
}
# qs="&${QUERY_STRING}"
qs="&$(cat)" # POST to not log the sha1.
case "$(echo -n "${qs}" | cut -c 1-6)" in
"&sha1=")
sha1="$(echo "${qs}" | cut -c 7-46 | tr 'abcdef' 'ABCDEF')"
;;
*) do_retry ;;
esac
tpl="tpl.omg-yes.html"
shard="$(echo "${sha1}" | cut -c 1)"
# https://stackoverflow.com/a/39360056/349514
count="$(cdb -q "/home/mro/Downloads/pwned-passwords-v8-sha1-${shard}.cdb" "${sha1}" 2>/dev/null)"
[ $? = 0 ] || tpl="tpl.fine-no.html"
cat <<EOF
Status: 200 Ok
Content-Type: text/html; charset=utf-8
EOF
export count
envsubst < "${tpl}"
Conclusion
The older the tools, the better they work on cheap computers and plain text is a powerful data format.
The tools used are all decades old:
and work so well, because they don’t use fancy formats as xml or json but plain text.