Generating and Using Identifiers (Part 2)

Last month, I did a post about snowflake and UUIDv7 identifiers. I was pretty happy with it, but then I was playing around with Clew which is a recent, smaller web search engine and decided to look up my identifers post just to check it out.

Right underneath mine was a post by Unkey on the same topic. It has some good observations that I didn't think about when I wrote mine, so decided to expand on mine to include the ideas I like.

Series

“Copying UUIDs is annoying”

Yes, but I didn't think about it. Both UUIDs (including my current crush on v7) and my preference for the Crockford encoding both have the same problem with double-clicking to copy, because of the dashes.

  • 01-hz61-3s22-k8nr-6w6g-rv1y-5c52

I still feel that removing the separators turns it into a long series of numbers that is difficult to parse and elide in a consistent manner, so just removing the separators isn't an option. However, underscores don't have the same behavior (and the Unkey uses them later for the prefix discussion).

  • 01_hz61_3s22_k8nr_6w6g_rv1y_5c52

I don't normally think about underscores since they conflict with underlines for links, but in this case, the identifier should be treated as a whole, so favoring the double-click and single word behavior is a major benefit.

Groups of Four

This isn't from the Unkey post, but with underscores, I think the groups of four is excessive. I don't know why, but the lower bar somehow makes it more obvious. Plus, how often do you need to hand transcribe or read out loud the numbers.

I think splitting the identifiers into groups of eight would be sufficient, as long as they are consistently eight instead of UUIDs default formatting of 8-4-4-4-12 which is difficult for me.

  • 01_hz613s22_k8nr6w6g_rv1y5c52

I still think groups of eight (from the right) is still workable, has a natural break point for eliding (“rv1y5c52” in the above example) and doesn't need quite as much space.

“Prefixing”

I can't believe I forgot about prefixes. When Github and Gitlab both started adding prefixes to their API keys, I thought it was a great idea. Then, when I went to Gitea and later to Foregjo, seeing just a bunch of hex characters for an API key felt subtly “wrong” to me.

The Unkey post pointed out, purpose. A bunch of numbers is one thing, but knowing the purpose of the key helps identify it as a secret (e.g., something not to check into code) and also possibly give a way of preventing usage.

Naturally, since we want to treat an identifier, it should be separated by an underscore from the rest of the key. Also it solves a problem when the first character of the number is a number, it is an invalid code identifier. Not that someone is going to say:

const 01_hz613s22_k8nr6w6g_rv1y5c52 = getIdentifier();

However, it fits better into the “idea” of code to treat it as one. So, starting the prefix with a non-number would make this more useful:

  • player_01_hz613s22_k8nr6w6g_rv1y5c52

Separation

While thinking about it, I considered making a distinction between the identifier and the code with a double underscore.

  • bedor_player__01_hz613s22_k8nr6w6g_rv1y5c52

Originally, I didn't think it made sense because we don't really want to parse the identifier to pull out the prefix from the code itself. That said, there is one situation where we need to be able to parse: eliding.

  • rv1y5c52

Just breaking off the last group would result in stripping off the contextual prefix. We want to keep that information even when shortened, which means we have to have a mechanical way of identifying the prefix to determine how to break it apart.

  • bedor_player__01_hz613s22_k8nr6w6g_rv1y5c52
  • bedor_player__rv1y5c52

The double underscore also means that it remains a single word for purposes of selection.

Global Prefixes

Having multiple words does make sense to me, so “bedor_player” seems reasonable to have since it can be arbitrarily long. That said, the identifiers already produce a “globally unique” value, so the prefix also doesn't have to be global. That means we just need enough scope information for the producer of the identifier, but not a full scope like:

  • com_mfgames_bedor_player__01_hz613s22_k8nr6w6g_rv1y5c52

Good examples would be:

  • forgejo__01_hz613s22_k8nr6w6g_rv1y5c52
  • f__01_hz613s22_k8nr6w6g_rv1y5c52
  • api_prod__01_hz613s22_k8nr6w6g_rv1y5c52
  • ap__01_hz613s22_k8nr6w6g_rv1y5c52

In this case, shorter is better.

Suffixes

That led into the question about suffixes. For example:

  • api_prod__01_hz613s22_k8nr6w6g_rv1y5c52
  • api__01_hz613s22_k8nr6w6g_rv1y5c52__prod

The main reason I don't think suffixes make sense is that I believe most people look at the beginning and the end of variables in most cases. And, having a textual prefix and suffix adds more complexity since both have to be looked at to understand the scope. Having only a prefix means the contextual information is always on the same side of the code.

Eliding

Eliding (abbreviating) is important when dealing with large identifiers. Git allows you to use an arbitrary number for reducing the hashes to identify a Git, but I remember trying to figure out if six or eight was the best for some purpose. With the grouping above, there is a natural break at eight and sixteen.

I realized this really is only a concern if there are smaller numbers were most of the identifier is zero such as 00000000-0000-0000-0000-000000000000 which has a Crockford encoding of 0.

  • api__0

However, for most UUIDv7 identifiers, this isn't possible because the first 48-bits are a timestamp and won't ever be zero (again) after that epoch millisecond.

But, there is a known problem with eliding: there is a higher level of collisions:

  • 01_hz613s22_k8nr6w6g_rv1y5c52 is “rv1y5c52”
  • 01_k8nr6w6g_hz613s22_rv1y5c52 is “rv1y5c52”

In these cases, there needs to be a check for duplicates as part of the code. That is the cost of being able to elide, additional complexity when selecting.

I will have to mention, eliding is a “human” thing, not a protocol thing. We wouldn't be sending up an elided identifier as part of a HTTP header or an authorization key. We need eliding for showing a table of all known identifiers so someone can see additional information or delete the key. That said, in a grid, you probably don't need the prefix, but the double underscores makes it possible to have it just in case you mix your keys in the same grid.

“Specification”

There isn't a formal specification, but a current summary of my wandering thoughts.

  • Generate a 128-bit UUIDv7
  • Encode the 128-bit number into lowercase Crockford32
  • Put underscores every eight characters starting at the right
  • If eliding, remove the left-most groups until the desired length
  • Add an optional prefix that does not start with a number with a “__” separator

Conclusion

I'm still happy having a framework for this, though I could see extending some of my favorite C# strong typing library: StronglyTypedId to support the format and prefixes. I mean, it would be nice to be able to say:

using System;
using StronglyTypedIds;

[StronglyTypedId(Template.Crockford, Prefix="a")]
public partial struct ApiId { }

var id = new ApiId(Guid7.NewGuid());

Console.WriteLine(id);
// a__01_hz613s22_k8nr6w6g_rv1y5c52

Console.WriteLine(id.Elide(1));
// a__rv1y5c52

Hopefully, I've worked out most of the kinks I've found. I might come back, or I might formalize this into libraries. Either way, I think it is a workable pattern for the difficulties I've experience in the past.

Metadata

Categories:

Tags: