Unique identifiers in distributed databases

Introduction

In this post, we will explore some properties of unique identifiers in Distribute databases, use cases, how the most popular unique id: UUID is created.

One challenge of distributed databases (specifically those with shared-nothing architecture) is data consistency, keeping all data consistent while staying performant is hard, since insert/updates can happen at different rates in different nodes.

One consequence of this is the lack of monotonically increasing ids, commonly known as auto incremental; that is a column where every time a new row is inserted, the column's value gets incremented monotonically (usually by 1).

One example you might be familiar with is the SERIAL datatype in postgres:

        CREATE TABLE mytable
(
    id    serial primary key,
    data  VARCHAR(128)
);
 
INSERT INTO mytable (data) VALUES ('some data');
INSERT INTO mytable (data) VALUES ('some data x2');

        SELECT * FROM sometable

id	data
1	"some data"
2	"some data x2"

The id value was incremented by one every time we inserted a row.

This is possible because the database keeps track of the count with a counter (called sequences in Postgres). This is effectively a unique id (within the table) and is commonly used as a primary key.

But in our distributed world every insert has to be executed in every node or instance of the database, in order for the nodes to increase the id correctly they would need to communicate to keep their counters in sync, in a read-heavy scenario this would mean massive internode communication, locking and a consequent decrease in write performance, defeating one of the nice traits of distributed databases. That's why all if not all distributed databases don't usually implement auto-incremental columns.

The need for uniqueness

In databases, we need to uniquely identify rows. Primary keys are used for that. If you don't define a primary key, database systems usually create one internal unique identifier for you.

By definition, primary keys need to be unique and not null, as we explained earlier, most distributed databases don't implement auto-incremental columns, so what do we use instead?

There are two different flavors to create ids: coordinated, like auto-incremental columns and uncoordinated.

Uncoordinated means that any database instance or node should be able to generate a valid id without talking to others, without collisions, and that's another of the challenges: Not creating the same id again, otherwise it wouldn't be unique.

We could create a massive blob of pseudo-random data that would assure us it would never be generated again in the lifespan of the universe and thus avoiding collisions, but that'd be not efficient. Imagine generating a 1GB unique id for every row in your database, it's just not feasible. Like in encryption, there is a careful balance between how much uniqueness we want and how much we pay for it.

That's why there is not a single way of creating an uncoordinated identifier, and depending on its components and how its created, it will have different uniqueness guarantees, and other properties.

Some common components are:

pseudo-random data
timestamps
metadata (thread number, MAC address, process id)
counters

And some examples of uncoordinated unique ids:

creation timestamp (simple created_at field in a table)
UUID4
Twitter snowflake (creation timestamp + machine id + increment)
Ulid (creation timestamp + random)

Sortable Ids are amazing

Being unique is the bare minimum requirement for a primary key, but there is another property that we lose by not being able to use a sequence, the ability to sort.

Having a column that is sortable makes aggregations more efficient and enables features like pagination, incremental queries or last-write detection.

For example, connector-x uses a sorted column to do client side query partitioning, making loading data from a database to a dataframe very fast. More specifically, it works by issuing SELECT MIN(field), MAX(field) FROM table, and computing different 'buckets'. It then issues several queries in different threads concurrently.

But this only works if we can create buckets or partitions on a column that can be sorted. We can use the same technique to create a client side pseudo-paginator for any table, which can be useful when batch-processing large tables.

An example of this in Python:

        class BatchedTable:
    def __init__(self,
                 table_name: str,
                 id_column: str,
                 flat: bool,
                 chunk_size: int = 2048):
        self.table_name = table_name
        self.id_column = id_column
 
        self.flat = flat
        self.chunk_size = chunk_size
        self.chunk_bucket = 1
        self.connection = client.connect('http://localhost:4200')
        self.cursor = self.connection.cursor()
 
    def query_for(self):
        return (
          f'select * from {self.table_name}'
          f' where'
          f' {self.id_column} >'
          f' {(self.chunk_bucket - 1) * self.chunk_size} and'
          f' {self.id_column} <='
          f' {((self.chunk_bucket) * self.chunk_size)} order by'
          f' {self.id_column}'
        )
 
    def __iter__(self):
        while True:
            query = self.query_for()
            self.cursor.execute(query)
            self.chunk_bucket += 1
            result = self.cursor.fetchall()
 
            if len(result) < self.chunk_size:
                return
 
            if self.flat:
                for l in result:
                    yield l
            else:
                yield result
 
 
def table(table_name: str,
          batch_size: int = 10_000,
          id_column: str = 'row_number',
          flat: bool = True):
    return BatchedTable(table_name=table_name,
                        id_column=id_column,
                        flat=flat,
                        chunk_size=batch_size)
 
if __name__ == '__main__':
    for rows in table('search3', batch_size=1000, flat=False):
        print(rows)
 
    # [(...),...] 1k rows
    # [(...),...] 1k rows
    # [(...),...] 1k rows
    # [(...),...] 1k rows
    ...

table will exhaust the whole table without hitting an out of memory error on large tables, again, all of this depends on a sortable id.

That's why some forms of uncoordinated ids that can be sorted, where created, usually if they contain a timestamp or a counter. Examples of this are ULID and UUID7

Re-cap

To re-cap, in databases we need to uniquely identify rows, a common technique is using an auto incremental column like SERIAL in postgres, but in distributed databases we don't typically have this option, due to architectural constraints, so we resort to creating uncoordinated ids using random data.

Generating random data to create a collision-free column usually works, but some use cases require the identifier to also be sortable.

About unique UUID

While many engineers and companies have developed their own way of creating unique IDs, the internet task force, the 'official' body that takes care of promoting and publishing RFCS (standards) have published several standards: UUID (Universally Unique Identifier).

Now we have more context of uniquely identifying rows in distributed databases. Let's try to understand the most popular and used ones UUIDs.

If you understand them at a fundamental level, you will pretty much understand every form of unique IDs there is, it's all very similar at the core.

Understanding UUIDs

There are eight versions of UUIDs, in May 2024 we finally got published the

last stable version

where version 7 and 8 were added, every version creates the UUID differently, and each version has different use cases.

The first versions 1–4 have historically been used in distributed systems, as distributed system evolved, so did the requirements for the IDs, hence the new variants.

You’ve probably seen them many times already; they’re those long IDs separated by dashes that look like 51000350-1197-4f2e-bcef-ca8bc5e11b51 (UUID4).

Anatomy of an UUID

The UUID has 128 bits.

Bits of an UUID 4

The 128 bits are grouped in 16 octets or bytes. Counting from 0 to 15.

128 bits, grouped in octets

This is at the core, what an UUID is, and the different versions just dictate how we generate these octets.

We can represent an UUID in different data 'types', depending on the system, these types will typically be just the data represented in different numerical bases:

Base 2 (binary): See image
Base 10: 164584730332688677464161912706729264512
Base 16: 0x7bd1ddb5b15c4b68a507fd4ceb984580
Base 16 with dashes: 7bd1ddb5-b15c-4b68-a507-fd4ceb984580
Base64: e9HdtbFcS2ilB/1M65hFgA==

What's commonly used and the default representation implementation for UUIDs is base 16 with dashes.

To give you a clearer look at how everything comes together, let's see the base16 (hex) value of every octet, you can try this yourself in Python with:

        >>> hex(0b1111011) # The first octet
'0x7b'

Now, the difference between UUID versions is what we decide what these groups of bits will be.

The bits are split in groups of bits, there are common groups between versions: the position of the version bit (48 to 51) and variant (bit 64 to 65)

UUID4

UUID4 has 5 groups of bits:

random_a [0, 47] is random data.
version [48, 51] is the version.
random_b [52, 63] is random data.
variant [64, 65] is the variant type.
random_c [66, 127] is random data.

Another simple way to visualize it, is just to paint the inclusive first bit number of every group.

What UUIDs is CrateDB using?

CrateDB; a shared-nothing distribute database in 5.10.2 uses three different kinds ^[1] of unique IDs:

ElasticFlakes
UUID4 in base64
DirtyUUID.

ElasticFlakes

This implementation is inherited from the Open Source days of elasticsearch ^[2] they are a time based id optimized for Apache Lucene, the underlining library in which both CrateDB and Elasticsearch are based on.

The elasticflake, is used to generate a _id ^[3] for every row, and for the scalar function gen_random_text_uuid() ^[4]. Interesting enough, the documentation for the scalar says that it returns an 'ID' similar to flake IDs, but there are several differences to Flake IDs ^[5].

Flake Ids are 128 bits and k-ordered.

An elasticflake has 120 bits, divided in 15 octets or bytes.

It's composed of a random data + timestamp + mac address, divided in six groups:

random_a [0, 15] is random data, the LSB and MSF of a random long.
timestamp_a [16, 47] is timestamp, the minutes to year part of the timestamp.
metadata_a [48, 95] is randomized mac address (or better put, random data which source is the mac address of the node.)
timestamp_b [96, 103] the seconds part of the timestamp in millis.
random_b [104, 111] the middle byte of the random long.
timestamp_c [112, 119] the LSB byte of the timestamp in millis, the milliseconds part.

The flake is then converted to base64 using an alphabet that is URL safe, the default alphabet uses '/' as an encoding character, meaning you could not use it as an url query parameter.

Since they’re time-based ids, one would expect that you could sort and filter on them, but they aren’t.

One can naively check this:

        create table t (
  real_pos integer,
  uuid generated always as gen_random_text_uuid(),
  inserted_at generated always as now()
)

Then insert:

        insert into t22 (real_pos)
values (1), (2), (3)... --up to 10k

Trying to filter by _id, and uuid have different results than ordering by inserted_at:

        select _id, * from t
order by _id
limit 10

        select _id, * from t
order by uuid
limit 10

-	-	-	-
"-VFXkZUBNha00pvZ-RYI"	495	1741900216585	"-1FXkZUBNha00pvZ-RYJ"
"-VFXkZUBNha00pvZ8RNV"	11	1741900214613	"-1FXkZUBNha00pvZ8RNV"
"-VFXkZUBNha00pvZ9hXX"	343	1741900216023	"-1FXkZUBNha00pvZ9hXX"
"-VFYkZUBNha00pvZ-oDA"	16775	1741900282560	"-1FYkZUBNha00pvZ-oDA"
"-VFYkZUBNha00pvZ0XCd"	14325	1741900272029	"-1FYkZUBNha00pvZ0XCd"
"-VFYkZUBNha00pvZ33U-"	15093	1741900275518	"-1FYkZUBNha00pvZ33U-"
"-VFYkZUBNha00pvZ4nYv"	15245	1741900276271	"-1FYkZUBNha00pvZ4nYv"
"-VFYkZUBNha00pvZ53hl"	15549	1741900277605	"-1FYkZUBNha00pvZ53hl"
"-VFYkZUBNha00pvZ5Xdp"	15401	1741900277097	"-1FYkZUBNha00pvZ5Xdp"
"-VFYkZUBNha00pvZ8HyN"	16171	1741900279949	"-1FYkZUBNha00pvZ8HyN"

        select _id, * from t
order by inserted_at
limit 10

inserted_at returns the correct results, this is also observable in filtering as expected.

This happens for three reasons:

The used RFC Base64 doesn’t preserve sort order for unencoded strings, because of the alphabet it uses; by pure randomness one could actually sort on a small set of elements though.
Because the implementation detail of how the random data is generated. A random integer is generated like a seed and +1 is added every time a new id is created. So if we insert a few values, restart the cluster and insert again, we will have different seeds.
Because we have random data before any sortable components (timestamp parts)

You can observe this by looking at the prefix of the _id, they’re similar because they’re sequential. In setups with more than one node, this random seed gets refreshed more often, that's why you will only see sequentiality in small groups. If you only use one node, it's very visible.

If you want to check some of these things, see:

base32hex and a custom base64 are lexicographically sortable on uuid7 but base64 is not. ^[6]
elasticflake is still not lexicographically sortable in base32hex. ^[7]