Generating TPC-H data for PostgreSQL

Introduction

TPC-H is a popular performance benchmark for database systems. It's composed of several tables that follow a specification. It's also very useful to test data load/write integrations with databases.

Setting it up the data can be confusing, so in this guide we'll see how to generate and load different sizes of TPC-H data to PostgreSQL.

We will use tpch-rs to generate the dataset. There are several ways to install the library, the easiest ones are using python and docker. Pick the option that best fits your current working system.

Python

Using python's package manager pip:

        pip install tpchgen-cli

Collecting tpchgen-cli
  Downloading tpchgen_cli-2.0.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Downloading tpchgen_cli-2.0.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 5.9 MB/s eta 0:00:00
Installing collected packages: tpchgen-cli
Successfully installed tpchgen-cli-2.0.1

        tpchgen-cli --version

tpchgen 2.0.1

Using uv

        uv run --with tpchgen-cli tpchgen-cli --version

tpchgen 2.0.1

Docker

        docker run -it \
  ghcr.io/astral-sh/uv:python3.14-rc-bookworm \
  uv run --with tpchgen-cli \
  tpchgen-cli --version

tpchgen 2.0.1

Generating data

The following command will generate all the tables data in csv format, in the /data directory.

        tpchgen-cli -s 1 -f csv -o /data

The -s option specifies the scale factor, 1 is 1x, 10 is 10x, to give you an idea a 10x factor will make the lineitem table have ~60 million rows.

After running the command you should have several files; each a table, in the directory.

        ls /data

customer.csv lineitem.csv nation.csv orders.csv part.csv partsupp.csv region.csv supplier.csv

Python

        uv run --with tpchgen-cli tpchgen-cli -s 1 -f csv -o /data

Docker

        docker run -it --rm -v /data/:/data \
  ghcr.io/astral-sh/uv:python3.14-rc-bookworm \
  uv run --with tpchgen-cli \
    tpchgen-cli -s 1 -f csv -o /data

Loading data

Your PostgreSQL instance needs to have access to the /data directory.

Create the table(s)

There are 8 tables, one of the most used ones is lineitem, since it's the biggest one.

        CREATE TABLE IF NOT EXISTS "lineitem"(
  "l_orderkey"          INT,
  "l_partkey"           INT,
  "l_suppkey"           INT,
  "l_linenumber"        INT,
  "l_quantity"          DECIMAL(15,2),
  "l_extendedprice"     DECIMAL(15,2),
  "l_discount"          DECIMAL(15,2),
  "l_tax"               DECIMAL(15,2),
  "l_returnflag"        CHAR(1),
  "l_linestatus"        CHAR(1),
  "l_shipdate"          DATE,
  "l_commitdate"        DATE,
  "l_receiptdate"       DATE,
  "l_shipinstruct"      CHAR(25),
  "l_shipmode"          CHAR(10),
  "l_comment"           VARCHAR(44),
  "l_dummy"             VARCHAR(10));

Load csv

        COPY lineitem (
    l_orderkey,
    l_partkey,
    l_suppkey,
    l_linenumber,
    l_quantity,
    l_extendedprice,
    l_discount,
    l_tax,
    l_returnflag,
    l_linestatus,
    l_shipdate,
    l_commitdate,
    l_receiptdate,
    l_shipinstruct,
    l_shipmode,
    l_comment
)
FROM '/data/lineitem.csv'
DELIMITER ','
CSV HEADER;

Wrap up

There are 8 tables, in this guide we will only see how to load lineitem, you can easily find the DDL for the other tables in Google.