Have I been pwned? โ€“ DIY style

Tue, 30. Aug 2022

Categories: en development Tags: password security cdb sh sha awk cgi

tl;dr: look up sha1 sums via https://mro.name/2022/pwned/passwords/ but beware it doesn’t use the better pwned api.

While the venerable xkcd on password strength discourages alphabet soup, there’s a thing even more important:

Don’t ever use leaked passwords

But how would you know? Troy Hunt maintains a set of leaked passwords you can test your password candidate against online or download and test locally. (Online testing does not involve uploading your password).

I show how to handle such a large dataset and have fast lookups using low profile machinery โ€“ cheap hardware, djb’s cdb and some shell/awk scripting.

Get the dataset

be nice and download via torrent from https://haveibeenpwned.com/Passwords. I just use http curl however.

Slice it up

The whole set is way too big for a single cdb, so we split it into one file per first hex character of the sha1 password hashes. Expect that to run for some days and produce 16 files around 3.3G each:

#!/bin/sh

# map the first hex char of the sha to a database filename

# curl -LO 'https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v8.7z'
# sudo apt-get install p7zip
# p7zip -d pwned-passwords-sha1-ordered-by-hash-v8.7z
#
# revert:
# $ cdb -d pwned-passwords-v8-sha1-?.cdb | head | cut -d : -f 2 | sed 's/->/:/'
#
readonly raw="pwned-passwords-sha1-ordered-by-hash-v8.txt"

date
echo "segmenting"
cat "${raw}" \
  | tr -d '\015' \
  | tr ':' ' ' \
  | awk '//{f=substr($0,1,1);print >> f;fflush(f)}'

for c in 0 1 2 3 4 5 6 7 8 9 A B C D E F
do
  echo "shard ${c}"
  cdb -c -m "pwned-passwords-v8-sha1-${c}.cdb" \
    < "${c}" \
    && rm "${c}"
done
date

Query

A cgi is enough to look up the counter for a sha:

#!/bin/sh

do_retry () {
cat <<EOF
Status: 303 See Other
Location: .
Content-Type: text/plain

Retry
EOF
exit
}

# qs="&${QUERY_STRING}"
qs="&$(cat)" # POST to not log the sha1.

case "$(echo -n "${qs}" | cut -c 1-6)" in
"&sha1=")
  sha1="$(echo "${qs}" | cut -c 7-46 | tr 'abcdef' 'ABCDEF')"
  ;;
*) do_retry ;;
esac

tpl="tpl.omg-yes.html"

shard="$(echo "${sha1}" | cut -c 1)"
# https://stackoverflow.com/a/39360056/349514
count="$(cdb -q "/home/mro/Downloads/pwned-passwords-v8-sha1-${shard}.cdb" "${sha1}" 2>/dev/null)"
[ $? = 0 ] || tpl="tpl.fine-no.html"

cat <<EOF
Status: 200 Ok
Content-Type: text/html; charset=utf-8

EOF
export count
envsubst < "${tpl}"

Conclusion

The older the tools, the better they work on cheap computers and plain text is a powerful data format.

The tools used are all decades old:

and work so well, because they don’t use fancy formats as xml or json but plain text.