Generating and Using Identifiers
Over the last few months, I've had a number of conversations about unique identifiers. For starters, I get pulled into a periodic meeting to justify decisions we made about six years ago for my day job. Then there was a conversation on fedi (not a link to the conversation, it was too long ago and I auto-delete my posts).
Series
Introduction
So, identifiers. Almost everything in the database needs a way of unique identifying a record. Composite keys are nice and precise but a PITA to manage, which suggests a single field is much better.
For a long time, an int
(32-bit) was “good enough” because it easily handled ~2e9 (2,147,483,647) records (most database I've worked with don't give you unsigned integers and negatives are horrible to work with). Slap your favorite database's way of creating a unique number with every record (sometimes called IDENTITY
) and then hope everything works until you leave the project.
That doesn't always happen.
At many of my jobs, I've had to deal with millions (and occasionally billions) of rows of data. My current one, which is a multi-tenant database, we have customers with 10-20 million records each for the primary tables, easily a 3x or 10x that many on the secondary tables, and adding 100k records every day. That makes hitting the 2e9 limit uncomfortably probable since we're expected to be functional for at least three decades (imagine every gift card that is created and how each one has to be tracked for 3-5 years before it is filed with the state).
To handle that, we took the effort to bump up the key fields from int
to bigint
(64-bit, long
for C# folks). This gave us plenty of breathing room since now we can handle more records than we will ever have inserted into the database.
Horizontal Scaling
Automatically generating identifiers works great with a single database, but what happens when you have two, three, or even a hundred? Some multi-tenant applications get around this by having one database per customer, but that increases the maintenance load to manage a thousand or more databases. That increases update times and requires our DBA to watch everything for failures.
When the identifiers have to be shared, they need to be unique otherwise they aren't very good identifiers. In our case, we have a number of servers that are all capable of inserting rows and we don't want to funnel every row creation through a single database server to make sure they are unique (this called “single point of failure”). Instead, we want to make sure each one can create a unique identifier with some authority and be confident that another server isn't going to steal the same identifier.
I'm glossing over a lot of the specifics because this blog is focusing on a specific methodology of creating unique identifiers.
There are many techniques for having database servers create non-overlapping identifiers. We have sequences, identities in certain ranges, and lookup tables. The problem I find with most of these is that they require a certain amount of manual configuration and tweaking. Adding a new server causes everything to rebalance the possible generated IDs, or smaller “blocks” of identifiers are assigned and there is a risk of starving out one active server because it generates identifiers at eighteen times the rate of the others (this happened to me, the east coast server was answering 80% of the the load).
Ideally, we want a setup that lets us toss together a new server, start it up, and then have it insert itself into the process smoothly without disrupting the others.
Conventions
For the rest of this post, I'm going to use the following conventions:
r
means a random number, ideally cryptographically strong randomt
for bits related to a timestampd
for device or workers
serial number, starting at 0 and incrementing by onen
the negative bit4
the number 4
Enter UUIDs v4 (related GUIDs)
UUIDs are awesome and horrible in the same way. On one hand, they are fantastic for creating non-overlapping identity keys that are highly unlikely to have a collision. They are also 128-bit numbers which gives them a “huge” amount of space to work with.
The most common UUID I've deal with is the v4 which is entirely random except for a “4” jammed into the beginning of the third block.
In hex:
rrrrrrrr-rrrr-4rrr-rrrr-rrrrrrrrrrrr
Look at all those randoms. For a v4, there are 122 bits of random which means 5.3e36 possible numbers which makes it difficult to get a collision. I have never seen one in the wild, but that doesn't mean it is impossible. Just highly unlikely.
A GUID is the same as UUID, but with slightly different byte order.
Up until this point, most of the code I've dealt with has been creating v4 UUIDs. Microsoft recommended them, there is almost no chance to have a conflict, and they ensure a nice random coverage of identifiers.
...
Database hate them. The problem is that they are “random” which means when the database builds up an index of the rows, it wants to put everything in a nice neat order so it is easier to find things. When the database can find something quickly, then the user doesn't complain that it takes two seconds to open up a page.
The reason is that most relational databases (Postgresql, MariaDB, SQL Server) arranges data into “pages”, which are blocks of data. There are data pages (where the row data is located) and index pages (where the index is located). When the DB server inserts a row, it will arrange the data pages in primary key order (you can change with a “clustered” index).
Let's use this as an example, assuming that we can get four rows on a single page.
Page | ProjectId | ProjectTitle |
---|---|---|
1 | 0d5a76af-8a9f-460b-8e00-bcb23d8e4d57 | Flight of the Scions |
1 | 148dc206-d9bd-464a-a5b9-8ba49856fd46 | Sand and Bone |
1 | 4aaab767-4571-42f3-9b2f-4d1e69c73e23 | Sand and Blood |
1 | 69b46634-46dc-452b-a4d8-468403310009 | Sand and Ash |
2 | a2fbaa43-52da-484c-9dc2-3352b4b0aa8f | Second-Hand Dresses |
2 | ce0db656-97a5-4b06-85e1-173074b96950 | Allegro |
2 | e8af805b-bb8c-4adb-98a6-7fa9c4a86bb8 | Nor Curse Be Found |
2 | f991f021-8e83-4f23-8c75-ed4ba0953915 | Pack Daughter |
If we insert a new row, such as “5e187e66-25bb-4e5b-ae4c-07e986cb92fa” for Son of Vo, we have to insert it between Sand and Blood and Sand and Ash, but since we already have two filled pages, we have to split the first page by creating a new page, inserting it between pages one and two, and then linking them together (basically turn the old page two into three, pages are a linked list). Then, then new row is added to one of them (in this case, the least full one).
Page | ProjectId | ProjectTitle |
---|---|---|
1 | 0d5a76af-8a9f-460b-8e00-bcb23d8e4d57 | Flight of the Scions |
1 | 148dc206-d9bd-464a-a5b9-8ba49856fd46 | Sand and Bone |
1 | 4aaab767-4571-42f3-9b2f-4d1e69c73e23 | Sand and Blood |
2 | 5e187e66-25bb-4e5b-ae4c-07e986cb92fa | Son of Vo |
2 | 69b46634-46dc-452b-a4d8-468403310009 | Sand and Ash |
3 | a2fbaa43-52da-484c-9dc2-3352b4b0aa8f | Second-Hand Dresses |
3 | ce0db656-97a5-4b06-85e1-173074b96950 | Allegro |
3 | e8af805b-bb8c-4adb-98a6-7fa9c4a86bb8 | Nor Curse Be Found |
3 | f991f021-8e83-4f23-8c75-ed4ba0953915 | Pack Daughter |
Now imagine doing this a million times. Millions of jamming rows into gaps and splitting when there isn't room. By the end, you'll have all these empty spaces because re-indexing (rebuilding the data pages so they are nice and packed) is an expensive operation so it isn't done automatically. Deleting rows will also create spaces but if there is a space while creating a row, then the server will use it and fill it back up.
All of these empty spaces are called fragmentation. When it gets back, the database server spends a lot of time jumping over the empty spaces. Every time there is a page split, the server has to stop and do the split. Rebuilding the index requires it to stop longer to pack everything in.
When setting up the table, one could set the “fill factor” which is how much space to reserve but insert enough GUIDs into the table and the page splitting starts to happen.
Also, UUIDs are 128-bits which means they aren't a “native” data type for most database servers and servers (64-bit). This means they are usually stored as an array of bytes or strings.
Snowflakes
This where I was about seven or eight years ago. We knew that we needed to move toward horizontal scaling, UUIDs were not an option if we wanted performance, and we needed something that wouldn't require a lot of work.
Enter snowflake IDs. Originally from Twitter, the 64-bit snowflake has four parts:
- 1 bit to always make it positive
- 41 bits of timestamp
- 10 bits device identifier
- 12 bits of serial
In a bit layout because for some reason, they didn't make these numbers fit on a hex boundary:
ntttttttttttttttttttttttttttttttttttttttttddddddddddssssssssssss
The biggest part was that the timestamp part was in the front. Since time was always increasing, this meant that the generated identifiers were “mostly” increasing. This meant that there was significantly less random inserts in the middle of the index and everything was appended at the end (or nearly at the end depending on frequency).
It also had the device ID. This meant that a server could “reserve” a number (0-1023) and there would be no chance of a collision. Of course, that also meant that the devices (which could be a server or process) had to coordinate the device assignment.
The last part was a serial number that started with 0 and went up to 4096. Since the timestamp was frequently at millisecond scale, that meant we could create 4096 rows per millisecond with no chance of collision.
For my day job, this is what we went with since it was the smallest data type that fit our expected key spaces (billions but not quadrillions), and we were able to coordinate the device identifiers to ensure no chance of collisions.
Also, snowflakes are 64-bit which means most languages can handle them natively (not Javascript though, which chokes on them). The need to coordinate the device, however, makes it not a great option for truly distributed services (such as Fediverse servers or Matrix rooms).
There are also 128-bit versions of snowflake IDs in some libraries.
UUID v7
Recently, I learned about UUIDv7. Like the v4, this is a long list of hash with a “7” in the third group. The primary difference is that the first part is a millisecond-based timestamp instead of random number but the latter half is still random.
In hex:
tttttttt-tttt-7rrr-rrrr-rrrrrrrrrrrr
This solves the page split problem nicely by taking that leading 48-bits of timestamp so the value are always increasing and uses a smaller random space (74 bits or only 1.89e22 possible numbers). This means there is a greater chance of having a collision but only if the numbers were generated in the same millisecond (that's the big part).
UUIDv7 also doesn't need the coordination of snowflakes, which lowers the complexity of generating a new identifier. At the moment, languages like C# doesn't have a built-in version for v7 and databases still struggle with the larger data type (compared to a 64-bit number which is native to most processors).
For most of what I do, those negatives are relatively trivial with v7, so any new project would be based on UUIDv7 and I'd feel pretty good about it.
ULIDs
Notable, there are also ULID (specification) which also try to solve the problem. These are a lot like UUIDv7 except:
- They don't have the version in the third group
- They have the same 48 bits of timestamp
- They have 80 bits of random instead of 74 for an increase of randomness (1.21e24 instead of 1.89e22)
They are also formatted with capital Crockford Base32 encoding instead of hex numbers. This creates a considerably more compact format such as “01HZ613S22K8NR6W6GRV1Y5C52” instead of “022f4166-c274-43b1-a69d-a9c3c2153ca2” (26 characters verses 36).
While the ULID page says it is compatible with UUIDs, they aren't because of that missing version field. There are systems that validate those entries so I find that claim to be not valid when dealing with very strict people (like my DBA and customers). Still, they are “close” to UUID and fit in much the same space.
Formatting
In all of these cases, there is one other issue to address: formatting.
UUIDs and GUIDs are terrible in terms of how much space they use. Hex is great as a case-insensitive format, but the grouping of 8-4-4-4-12 makes it difficult to read over a phone or to find differences in the numbers. I struggle with more than 3-4 numbers at a time, so trying to enter in the twelve at the end can be a trial while being distracted.
The Crockford encoding of ULIDs is much better because it uses a wider range of characters than hex which allows the actual identifier to be fewer characters. It also has some ability to handle difficult letters because there is no “I” or “L” in the encoding which means they can be treated as a “1” and still have the same results.
But, for ULIDs, the resulting ID should be in all capitals and I consider this to be a problem because I have a degree of dyslexia and it is easy to get lost in a 26-character series of letters and numbers all at the same height. It's basically a “bar” of numbers.
In the end, what I found that was easier with customers and end users was to make the identifier lowercase to help distinguish the numbers and then break it into groups of four with dashes (favoring the right side). Using the above example of “01HZ613S22K8NR6W6GRV1Y5C52”, that would result in “01-hz61-3s22-k8nr-6w6g-rv1y-5c52” which is slightly longer (32 characters) but makes it less overwhelming to read it over a phone or type it into a computer balanced on a box with a keyboard.
These values can also be elided nicely much like you only have to enter the first 6-8 characters of a Git hash to reference it, there is a “reasonable” chance of avoiding overlaps if you only go with the last two blocks. I like to put “rv1y-5c52” in a web or GUI column but then put the full value in a copy button or as a tooltip.
Conclusion
At this point, any new project I create would be based on UUIDv7. While they have a tiny smaller random space than ULID and UUIDv4, they play well with databases and have good support in most languages (C# and others have native UUID/GUID support, though maybe not v7 generation).
The readability is an important aspect of mine, so I see myself formalizing on the lowercase with dashes with eliding to smaller numbers when the detail isn't needed.
Overall, I like exploring these situations. I'll admit, I wish I knew about ULIDs or UUIDv7 eight years ago, but it is what it is.
Metadata
Categories:
Tags: