64-bit hash for Postgres and BigQuery
Did you ever Google for a solution to something, find one, then wonder who wrote it only to discover it was you ? Well as of today I have. There had to be a reason the code looked so familiar and it fit my problem so perfectly I wasn’t even paying attention to what website I was looking at that.
Which brings me to a follow post to 64-bit hex number conversion which describes how to convert a 16 character hex string to the BigQuery INT64 type.
Why 64-bit numbers gain?
Today I found myself needing a 64-bit hash function to generate something I can use as an identity for some
potentially very large string columns. Instead of storing the string I store the hash as a 64-bit integer type i.e.
BIGINT on Postgres or
INT64 on BigQuery. Previously I had been using the BigQuery function FARM_FINGERPRINT which
Google promises is both very fast and high quality hash even though it isn’t
designed to be cryptographically strong - it is after all “only” 64-bits. Since I’m dealing with collections of tens,
to a few hundreds of million of strings the chances of a random collision with 64-bits is still vanishingly small
(one in tens of billions). If I was really paranoid I could check an invariant before exporting hashed data to somewhere
that matters e.g. assert
count(distinct hash(string)) == count(distinct string).
The problem with Farm fingerprint
The problem with Farm fingerprint is although it is based on the FarmHash library which has been open source for several years it doesn’t seem to be widely other than in Google products. Initially I was only using the hash values internally to normalize some tables and avoid repeating big long strings. The data was being computed in BigQuery and exported to Postgres and the later didn’t really need to know what hash algorithm was being used.
Postgres’ best alternative?
I found myself wanting to also generate these hash values in Postgres but there are no implementations of FarmHash for Postgres that I could use. Looking at the list of hash functions we find that MD5 is by far the fastest with 150M hashes/sec on their benchmark system which is 4x the nearest competing hash SHA1.
64-bit MD5 in Postgres
MD5 generates a 128-bit value but for maximum compactness I’d like the id to be just 64-bits which as mentioned above
is I believe sufficiently unlikely to cause a collision for my data set that I don’t need to worry about it care.
Remember again that I don’t have a requirement for a hash secure against forced hash-collision exploits. Furthermore
with judicious schema design I can ensure a hash collision at worst would cause a constraint error when inserting
data, or to cause an string id generated with the hash to resolve to
NULL instead of the “wrong” string.
After all Google seems happy with 64-bits for their farm fingerprint hash which gives a total of 16 billion billion
combinations (that’s 16 quintillion according to Wikipedia).
The MD5 function on Postgres outputs a 128-bit value in the form of a hex string.
So if you do:
select md5('hello, world');
you’ll get the string
Now it is relatively straightforward to just chop off the first 16 characters which represent the most significant 64-bits of the value and convert to a 64-bit int. A StackOverflow post offers up this pure SQL function to do that with the minimum of effort for a string input:
create function h_bigint(text) returns bigint as $$ select ('x'||substr(md5($1),1,16))::bit(64)::bigint; $$ language sql;
But this left me wondering - how do you do this with BigQuery?
MD5 on BigQuery
Like Postgres, BigQuery offers MD5 as a builtin. But the return value of this function is of type
length 16. So our
original hello world test now gives us this result
5NfxtO0uQtFYmPSyewGdpA== which is a Base64 encoding of the
128-bit MD5 value.
Now if you look at the conversion rules
for BigQuery you will find that the only things you can convert
BYTES to are more
But it turns out that BigQuery has a handy-dandy
function that takes bytes and outputs the hexadecimal version. So now if we do:
select to_hex(md5('hello, world'));
we get the hex string
e4d7f1b4ed2e42d15898f4b27b019da4 which fortunately is exactly what Postgres gave us before.
A simple application of
substr() will also give us the most-significant 64-bits which we can then cast to an
type. Let’s try that…
select cast(substr(to_hex(md5('hello, world')), 1, 16) as int64);
And BOOM! there it is:
Bad int64 value: e4d7f1b4ed2e42d1
Past me helps future me
So this is where I Googled “converting 64 bit hex string to int64 on BigQuery” and found myself brought right back here as previously mentioned.
With help from past me I can now define a temporary user defined function (UDF) in SQL that calculates the 16 character
hex string of the first 64-bits of the MD5 and then calls my previously defined 64-bit hex to
UDF like so:
CREATE TEMP FUNCTION hex64_to_int64(hex STRING) RETURNS INT64 AS ( IF(hex < "8000000000000000", cast(concat("0x", hex) AS INT64), (SELECT (((ms32 & 0x7fffffff) << 32) | ls32) - 0x7fffffffffffffff - 1 FROM (SELECT cast(concat("0x", substr(hex, 1, 8)) AS INT64) AS ms32, cast(concat("0x", substr(hex, 9, 8)) AS INT64) AS ls32))) ); CREATE TEMP FUNCTION md5_int64(s STRING) RETURNS INT64 AS ( hex64_to_int64(substr(to_hex(md5(s)),1,16)) );
Now I can simply call
select md5_int64("hello, world"); and get back
-1956829753693551919 which is
exactly what the Postgres
h_bigint function returns (phew!).