Introduction

CrateDB in 5.10.2 (and in most versions) uses three different kinds [1] of unique IDs:

ElasticFlakes
UUID4
DirtyUUID.

This implementation is inherited from the Open Source days of Elasticsearch [2] they are a time-based id optimized for Apache Lucene, the underlining library in which both CrateDB and Elasticsearch base their storage.

The elasticflake, is used to generate a column named _id [3] for every row, and for the scalar function gen_random_text_uuid() [4].

Structure

An elasticflake has 120 bits and is divided in 15 octets or bytes.

It's composed of a sequence + timestamp + mac address, divided in six groups:

#	Field	Range	Description
1	sequence_a	0, 15	A sequence that increments, based on a random long; the LSB and MSF.
2	timestamp_a	16, 47	Timestamp; the minutes-to-year part of the timestamp.
3	metadata_a	48, 95	Randomized MAC address (random data sourced from the node MAC address).
4	timestamp_b	96, 103	Seconds part of the timestamp in milliseconds.
5	sequence_b	104, 111	The middle byte of the random based sequence.
6	timestamp_c	112, 119	LSB byte of the timestamp in milliseconds; the milliseconds part.

Structured as follows:

The id is then converted to base64 using an alphabet that is URL safe, resulting in a string that has 20 characters.

        >>> select gen_random_text_uuid();
b8V2a5sBnHiLIAB_yCXG

Sortability

Since they’re time-based ids, one would expect that you could sort and filter on them, but they aren’t.

One can naively check this:

        create table t
(
    real_pos    INTEGER,
    uuid        TEXT      DEFAULT gen_random_text_uuid(),
    inserted_at TIMESTAMP DEFAULT now()
)

Then insert several values:

        insert into t (real_pos)
values (1) -- up to n

Trying to filter by _id, and uuid have different results than ordering by inserted_at:

        select _id, *
from t
order by _id
limit 10

_id	real_pos	uuid	inserted_at
"gMUab5sBnHiLIAB_IyVC"	1	"2dYab5sBB8EWMoDKI2FD"	1767095542595
"gcUab5sBnHiLIAB_KyUs"	2	"29Yab5sBB8EWMoDKK2Eu"	1767095544621
"gsUab5sBnHiLIAB_NCVi"	3	"hMUab5sBnHiLIAB_NCVi"	1767095546978
"hcUab5sBnHiLIAB_OSW5"	4	"h8Uab5sBnHiLIAB_OSW6"	1767095548346
"iMUab5sBnHiLIAB_RCVs"	5	"3dYab5sBB8EWMoDKRGFt"	1767095551085
"icUab5sBnHiLIAB_SiU-"	6	"i8Uab5sBnHiLIAB_SiU_"	1767095552575
"j8Uab5sBnHiLIAB_YiXv"	10	"4dYab5sBB8EWMoDKYmHx"	1767095558897
"jMUab5sBnHiLIAB_VCU_"	7	"39Yab5sBB8EWMoDKVGFA"	1767095555136
"jcUab5sBnHiLIAB_WSWp"	8	"t84ab5sBP3MoMmUXWZWq"	1767095556522
"jsUab5sBnHiLIAB_XyUC"	9	"uc4ab5sBP3MoMmUXX5UE"	1767095557892

Notice how everything seems sorted but the row with real_pos 10. There will always be some order maintained within small buckets of ids, this is because the first group (sequence_a) is a sequence. However, this sequence resets every time the int overflows or every time you restart the node. The sequence will also be different if another node generates the id. Since they all hold different sequences.

        select _id, * from t
order by inserted_at
limit 10

_id	real_pos	uuid	inserted_at
"gMUab5sBnHiLIAB_IyVC"	1	"2dYab5sBB8EWMoDKI2FD"	1767095542595
"gcUab5sBnHiLIAB_KyUs"	2	"29Yab5sBB8EWMoDKK2Eu"	1767095544621
"gsUab5sBnHiLIAB_NCVi"	3	"hMUab5sBnHiLIAB_NCVi"	1767095546978
"hcUab5sBnHiLIAB_OSW5"	4	"h8Uab5sBnHiLIAB_OSW6"	1767095548346
"iMUab5sBnHiLIAB_RCVs"	5	"3dYab5sBB8EWMoDKRGFt"	1767095551085
"icUab5sBnHiLIAB_SiU-"	6	"i8Uab5sBnHiLIAB_SiU_"	1767095552575
"jMUab5sBnHiLIAB_VCU_"	7	"39Yab5sBB8EWMoDKVGFA"	1767095555136
"jcUab5sBnHiLIAB_WSWp"	8	"t84ab5sBP3MoMmUXWZWq"	1767095556522
"jsUab5sBnHiLIAB_XyUC"	9	"uc4ab5sBP3MoMmUXX5UE"	1767095557892
"j8Uab5sBnHiLIAB_YiXv"	10	"4dYab5sBB8EWMoDKYmHx"	1767095558897

inserted_at returns the correct results, this is also observable in filtering as expected.

Additionally, Base64 doesn’t preserve lexicographical order for encoded strings, because of the alphabet it uses, by chance some buckets of elements can maintain orderability if the range of the alphabet's character is ordered. To dive deeper into this see How to make a sortable BASE64 encoding

Recap

The Elasticflake is a time-based 120 bit wide unique id that is encoded into a url-safe base64 resulting in a 20 character string.
It does not maintain lexicographical order, only in small buckets.

Furthermore:

base32hex and a custom base64 are lexicographically sortable on uuid7 but base64 is not. [5]
Elasticflake is still not lexicographically sortable in base32hex. [6]

3. CrateDB GitHub [Online.] Available: https://github.com/crate/crate/blob/79ff2217dde45d6a748b0f31b36c95a7b252878c/server/src/main/java/io/crate/analyze/Id.java#L48

4. CrateDB GitHub [Online.] Available: https://github.com/crate/crate/blob/master/server/src/main/java/io/crate/expression/scalar/GenRandomTextUUIDFunction.java

5. surister GitHub [Online.] Available: https://github.com/surister/mylab/blob/master/crate_uuid/sort.py

6. surister GitHub [Online.] Available: https://github.com/surister/mylab/blob/master/crate_uuid/elasticflaketest.py

Unique ids in CrateDB

Introduction

ElasticFlakes

Structure

Sortability

Recap

UUID4

DirtyUUID

References