Unique ids in CrateDB

Written by

Introduction

CrateDB in 5.10.2 (and in most versions) uses three different kinds [1] of unique IDs:

  • ElasticFlakes
  • UUID4
  • DirtyUUID.

ElasticFlakes

This implementation is inherited from the Open Source days of Elasticsearch [2] they are a time-based id optimized for Apache Lucene, the underlining library in which both CrateDB and Elasticsearch base their storage.

The elasticflake, is used to generate a column named _id [3] for every row, and for the scalar function gen_random_text_uuid() [4].

Structure

An elasticflake has 120 bits and is divided in 15 octets or bytes.

It's composed of a sequence + timestamp + mac address, divided in six groups:

#FieldRangeDescription
1sequence_a0, 15A sequence that increments, based on a random long; the LSB and MSF.
2timestamp_a16, 47Timestamp; the minutes-to-year part of the timestamp.
3metadata_a48, 95Randomized MAC address (random data sourced from the node MAC address).
4timestamp_b96, 103Seconds part of the timestamp in milliseconds.
5sequence_b104, 111The middle byte of the random based sequence.
6timestamp_c112, 119LSB byte of the timestamp in milliseconds; the milliseconds part.

Structured as follows:

The id is then converted to base64 using an alphabet that is URL safe, resulting in a string that has 20 characters.

        
>>> select gen_random_text_uuid();
b8V2a5sBnHiLIAB_yCXG

Sortability

Since they’re time-based ids, one would expect that you could sort and filter on them, but they aren’t.

One can naively check this:

        
create table t
(
real_pos INTEGER,
uuid TEXT DEFAULT gen_random_text_uuid(),
inserted_at TIMESTAMP DEFAULT now()
)

Then insert several values:

        
insert into t (real_pos)
values (1) -- up to n

Trying to filter by _id, and uuid have different results than ordering by inserted_at:

        
select _id, *
from t
order by _id
limit 10
_idreal_posuuidinserted_at
"gMUab5sBnHiLIAB_IyVC"1"2dYab5sBB8EWMoDKI2FD"1767095542595
"gcUab5sBnHiLIAB_KyUs"2"29Yab5sBB8EWMoDKK2Eu"1767095544621
"gsUab5sBnHiLIAB_NCVi"3"hMUab5sBnHiLIAB_NCVi"1767095546978
"hcUab5sBnHiLIAB_OSW5"4"h8Uab5sBnHiLIAB_OSW6"1767095548346
"iMUab5sBnHiLIAB_RCVs"5"3dYab5sBB8EWMoDKRGFt"1767095551085
"icUab5sBnHiLIAB_SiU-"6"i8Uab5sBnHiLIAB_SiU_"1767095552575
"j8Uab5sBnHiLIAB_YiXv"10"4dYab5sBB8EWMoDKYmHx"1767095558897
"jMUab5sBnHiLIAB_VCU_"7"39Yab5sBB8EWMoDKVGFA"1767095555136
"jcUab5sBnHiLIAB_WSWp"8"t84ab5sBP3MoMmUXWZWq"1767095556522
"jsUab5sBnHiLIAB_XyUC"9"uc4ab5sBP3MoMmUXX5UE"1767095557892

Notice how everything seems sorted but the row with real_pos 10. There will always be some order maintained within small buckets of ids, this is because the first group (sequence_a) is a sequence. However, this sequence resets every time the int overflows or every time you restart the node. The sequence will also be different if another node generates the id. Since they all hold different sequences.

        
select _id, * from t
order by inserted_at
limit 10
_idreal_posuuidinserted_at
"gMUab5sBnHiLIAB_IyVC"1"2dYab5sBB8EWMoDKI2FD"1767095542595
"gcUab5sBnHiLIAB_KyUs"2"29Yab5sBB8EWMoDKK2Eu"1767095544621
"gsUab5sBnHiLIAB_NCVi"3"hMUab5sBnHiLIAB_NCVi"1767095546978
"hcUab5sBnHiLIAB_OSW5"4"h8Uab5sBnHiLIAB_OSW6"1767095548346
"iMUab5sBnHiLIAB_RCVs"5"3dYab5sBB8EWMoDKRGFt"1767095551085
"icUab5sBnHiLIAB_SiU-"6"i8Uab5sBnHiLIAB_SiU_"1767095552575
"jMUab5sBnHiLIAB_VCU_"7"39Yab5sBB8EWMoDKVGFA"1767095555136
"jcUab5sBnHiLIAB_WSWp"8"t84ab5sBP3MoMmUXWZWq"1767095556522
"jsUab5sBnHiLIAB_XyUC"9"uc4ab5sBP3MoMmUXX5UE"1767095557892
"j8Uab5sBnHiLIAB_YiXv"10"4dYab5sBB8EWMoDKYmHx"1767095558897

inserted_at returns the correct results, this is also observable in filtering as expected.

Additionally, Base64 doesn’t preserve lexicographical order for encoded strings, because of the alphabet it uses, by chance some buckets of elements can maintain orderability if the range of the alphabet's character is ordered. To dive deeper into this see How to make a sortable BASE64 encoding

Recap

  • The Elasticflake is a time-based 120 bit wide unique id that is encoded into a url-safe base64 resulting in a 20 character string.
  • It does not maintain lexicographical order, only in small buckets.

Furthermore:

  • base32hex and a custom base64 are lexicographically sortable on uuid7 but base64 is not. [5]
  • Elasticflake is still not lexicographically sortable in base32hex. [6]

UUID4

A random UUID4 as per RFC 4122 (2005), in url safe Base64 encoding.

This unique id is not exposed to the user and is used in internal database operations.

DirtyUUID

It is just two random integers cobbled up together, not following the UUID RFC format.

This is also not exposed to the user, it is used in internal database operations and is more efficient to generate than the UUID4.

References

1. CrateDB GitBub [Online.] Available: https://github.com/crate/crate/blob/master/server/src/main/java/org/elasticsearch/common/UUIDs.java

2. S. Banon "Elasticsearch Is Open Source. Again!," Elastic Blog Accessed: Dec. 30, 2025. [Online.] Available: https://www.elastic.co/blog/elasticsearch-is-open-source-again

3. CrateDB GitHub [Online.] Available: https://github.com/crate/crate/blob/79ff2217dde45d6a748b0f31b36c95a7b252878c/server/src/main/java/io/crate/analyze/Id.java#L48

4. CrateDB GitHub [Online.] Available: https://github.com/crate/crate/blob/master/server/src/main/java/io/crate/expression/scalar/GenRandomTextUUIDFunction.java

5. surister GitHub [Online.] Available: https://github.com/surister/mylab/blob/master/crate_uuid/sort.py

6. surister GitHub [Online.] Available: https://github.com/surister/mylab/blob/master/crate_uuid/elasticflaketest.py