Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions docs/getting-started/example-datasets/job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
description: 'The Join Order Benchmark (JOB) data set and queries.'
sidebar_label: 'JOB'
slug: /getting-started/example-datasets/job
title: 'Join Order Benchmark (JOB)'
doc_type: 'guide'
keywords: ['example dataset', 'job', 'join order benchmark', 'benchmark', 'performance testing', 'query optimizer', 'join ordering']
---

The Join Order Benchmark (JOB) stresses the query optimizer with 113 analytical queries over a real-world, highly-correlated dataset (a snapshot of IMDb). Since its introduction, the JOB benchmark has become the de facto standard to assess the performance of relational database query optimizers, including cardinality estimation and join order optimization. Unlike synthetic benchmarks that assume uniform, independent data, JOB uses real data with skew and correlations, which makes it a hard test for join ordering and cardinality estimation.

Check warning on line 10 in docs/getting-started/example-datasets/job.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.OxfordComma

Use a comma before the last 'and' or 'or' in a list of four or more items.

Check warning on line 10 in docs/getting-started/example-datasets/job.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.OxfordComma

Use a comma before the last 'and' or 'or' in a list of four or more items.

Check warning on line 10 in docs/getting-started/example-datasets/job.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.OxfordComma

Use a comma before the last 'and' or 'or' in a list of four or more items.

Check warning on line 10 in docs/getting-started/example-datasets/job.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.OxfordComma

Use a comma before the last 'and' or 'or' in a list of four or more items.

The dataset holds about 74 million rows across 21 tables and takes around 1.15 GiB compressed in ClickHouse.

The 113 queries are organized into 33 families (`1`–`33`). Queries within a family (`a`, `b`, `c`, ...) share the same join graph but differ in their selection predicates.

**References**

- [How Good Are Query Optimizers, Really?](https://www.vldb.org/pvldb/vol9/p204-leis.pdf) (Leis et al., VLDB 2015)

Check notice on line 18 in docs/getting-started/example-datasets/job.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.Uppercase

Suggestion: Instead of uppercase for 'VLDB', use lowercase or backticks (`) if possible. Otherwise, ask a Technical Writer to add this word or acronym to the rule's exception list.
- [Join Order Benchmark](https://github.com/gregrahn/join-order-benchmark) repository

## Creating the tables {#creating-tables}

The JOB dataset is a snapshot of IMDb with 21 tables. The table definitions are available in [`init_cloud.sql`](https://github.com/ClickHouse/ClickHouse/blob/master/tests/benchmarks/job/init_cloud.sql) in the ClickHouse repository.

Each table uses the [`MergeTree`](/engines/table-engines/mergetree-family/mergetree) engine sorted by its primary key column `id`, mirroring the original PostgreSQL schema where every table declares `id integer NOT NULL PRIMARY KEY`. Nullable PostgreSQL columns map to `Nullable(...)` types.

Create the tables:

```bash

Check notice on line 29 in docs/getting-started/example-datasets/job.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.CodeblockFences

Suggestion: Instead of '```bash' for the code block, use yaml, ruby, plaintext, markdown, javascript, shell, go, python, dockerfile, or typescript.
curl -O https://raw.githubusercontent.com/ClickHouse/ClickHouse/master/tests/benchmarks/job/init_cloud.sql
clickhouse client --query "CREATE DATABASE IF NOT EXISTS job"
clickhouse client --database job --queries-file init_cloud.sql
```

## Loading the data {#loading-the-data}

The dataset is available as Parquet files in a public S3 bucket.

Load all 21 tables directly from S3 using the [`s3`](/sql-reference/table-functions/s3) table function:

```bash

Check notice on line 41 in docs/getting-started/example-datasets/job.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.CodeblockFences

Suggestion: Instead of '```bash' for the code block, use yaml, ruby, plaintext, markdown, javascript, shell, go, python, dockerfile, or typescript.
for table in aka_name aka_title cast_info char_name comp_cast_type company_name \
company_type complete_cast info_type keyword kind_type link_type \
movie_companies movie_info movie_info_idx movie_keyword movie_link \
name person_info role_type title; do
clickhouse client --database job --query \
"INSERT INTO ${table} SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/${table}.parquet', 'Parquet')"
done
```

Alternatively, load each table with an explicit `INSERT` statement.
Make sure to create the tables first using [`init_cloud.sql`](https://github.com/ClickHouse/ClickHouse/blob/master/tests/benchmarks/job/init_cloud.sql), then run the inserts against the `job` database:

```sql
INSERT INTO aka_name SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/aka_name.parquet', 'Parquet');
INSERT INTO aka_title SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/aka_title.parquet', 'Parquet');
INSERT INTO cast_info SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/cast_info.parquet', 'Parquet');
INSERT INTO char_name SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/char_name.parquet', 'Parquet');
INSERT INTO comp_cast_type SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/comp_cast_type.parquet', 'Parquet');
INSERT INTO company_name SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/company_name.parquet', 'Parquet');
INSERT INTO company_type SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/company_type.parquet', 'Parquet');
INSERT INTO complete_cast SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/complete_cast.parquet', 'Parquet');
INSERT INTO info_type SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/info_type.parquet', 'Parquet');
INSERT INTO keyword SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/keyword.parquet', 'Parquet');
INSERT INTO kind_type SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/kind_type.parquet', 'Parquet');
INSERT INTO link_type SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/link_type.parquet', 'Parquet');
INSERT INTO movie_companies SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/movie_companies.parquet', 'Parquet');
INSERT INTO movie_info SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/movie_info.parquet', 'Parquet');
INSERT INTO movie_info_idx SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/movie_info_idx.parquet', 'Parquet');
INSERT INTO movie_keyword SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/movie_keyword.parquet', 'Parquet');
INSERT INTO movie_link SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/movie_link.parquet', 'Parquet');
INSERT INTO name SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/name.parquet', 'Parquet');
INSERT INTO person_info SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/person_info.parquet', 'Parquet');
INSERT INTO role_type SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/role_type.parquet', 'Parquet');
INSERT INTO title SELECT * FROM s3('https://s3.eu-west-3.amazonaws.com/public-pme/join_bench/job/title.parquet', 'Parquet');
```

Detailed table sizes:

| Table | size (in rows) | size (compressed in ClickHouse) |
| --------------- | -------------- | ------------------------------- |
| aka_name | 901,343 | 31.86 MiB |
| aka_title | 361,472 | 14.32 MiB |
| cast_info | 36,244,344 | 296.25 MiB |
| char_name | 3,140,339 | 107.95 MiB |
| comp_cast_type | 4 | 132.00 B |
| company_name | 234,997 | 8.38 MiB |
| company_type | 4 | 162.00 B |
| complete_cast | 135,086 | 748.80 KiB |
| info_type | 113 | 1.25 KiB |
| keyword | 134,170 | 1.88 MiB |
| kind_type | 7 | 177.00 B |
| link_type | 18 | 284.00 B |
| movie_companies | 2,609,129 | 21.20 MiB |
| movie_info | 14,835,720 | 300.46 MiB |
| movie_info_idx | 1,380,035 | 8.01 MiB |
| movie_keyword | 4,523,930 | 21.06 MiB |
| movie_link | 29,997 | 178.21 KiB |
| name | 4,167,491 | 131.16 MiB |
| person_info | 2,963,664 | 154.12 MiB |
| role_type | 12 | 246.00 B |
| title | 2,528,312 | 78.04 MiB |
| **Total** | **74,190,187** | **1.15 GiB** |

(Compressed sizes in ClickHouse are taken from `system.tables.total_bytes` and based on the above table definitions.)

## Queries {#queries}

The 113 JOB queries can be found [here](https://github.com/ClickHouse/ClickHouse/tree/master/tests/benchmarks/job/queries) in the ClickHouse repository.
The settings used to run them are in [`settings.json`](https://github.com/ClickHouse/ClickHouse/blob/master/tests/benchmarks/job/settings.json).
See the [README](https://github.com/ClickHouse/ClickHouse/blob/master/tests/benchmarks/job/README.md) for known issues and notes on specific queries.

The queries reference the tables by name, so run them against the `job` database (for example, with `clickhouse client --database job`).

An example query (`1a`):

```sql
SELECT
MIN(mc.note) AS production_note,
MIN(t.title) AS movie_title,
MIN(t.production_year) AS movie_year
FROM company_type AS ct, info_type AS it, movie_companies AS mc, movie_info_idx AS mi_idx, title AS t
WHERE (ct.kind = 'production companies') AND (it.info = 'top 250 rank') AND (mc.note NOT LIKE '%(as Metro-Goldwyn-Mayer Pictures)%') AND ((mc.note LIKE '%(co-production)%') OR (mc.note LIKE '%(presents)%')) AND (ct.id = mc.company_type_id) AND (t.id = mc.movie_id) AND (t.id = mi_idx.movie_id) AND (mc.movie_id = mi_idx.movie_id) AND (it.id = mi_idx.info_type_id);
```

## Preparing the data from the original CSV files {#preparing-from-csv}

The Parquet files above are derived from the original IMDb snapshot used by JOB, which is distributed as one CSV file per table (`aka_name.csv`, `title.csv`, ...).
These CSVs use PostgreSQL `COPY` semantics with `ESCAPE '\'`: a backslash escapes the quote character only inside a quoted field, while outside quotes a backslash is a literal character.
ClickHouse expects RFC 4180 CSV (doubled quotes, no backslash escaping), so the files must be re-encoded first.

[`convert_csv.py`](https://github.com/ClickHouse/ClickHouse/blob/master/tests/benchmarks/job/convert_csv.py) performs that re-encoding.
It reads the original CSV on stdin and writes standard CSV on stdout, doubling embedded quotes and preserving empty unquoted fields (which ClickHouse maps to `NULL` for `Nullable` columns).

`init.sql` is the original, unmodified JOB schema, where columns are nullable unless declared `NOT NULL`. The tables must therefore be created with `data_type_default_nullable=1` (passed as `--data_type_default_nullable=1` below), otherwise the nullable columns are created as non-nullable and loading rows with NULLs fails.

To build the tables from the original CSVs:

```bash

Check notice on line 139 in docs/getting-started/example-datasets/job.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.CodeblockFences

Suggestion: Instead of '```bash' for the code block, use yaml, ruby, plaintext, markdown, javascript, shell, go, python, dockerfile, or typescript.
clickhouse client --query "CREATE DATABASE IF NOT EXISTS job"
clickhouse client --database job --data_type_default_nullable=1 --queries-file init.sql

for table in aka_name aka_title cast_info char_name comp_cast_type company_name \
company_type complete_cast info_type keyword kind_type link_type \
movie_companies movie_info movie_info_idx movie_keyword movie_link \
name person_info role_type title; do
python3 convert_csv.py < "${table}.csv" \
| clickhouse client --database job --query "INSERT INTO ${table} FORMAT CSV"
done
```

Once the tables are populated, they can be exported to Parquet for faster re-import later, e.g.
`clickhouse client --database job --query "SELECT * FROM title ORDER BY id FORMAT Parquet" > title.parquet`.
1 change: 1 addition & 0 deletions styles/ClickHouse/Headings.yml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ exceptions:
- JSON
- JSON format settings
- JSON settings
- Join Order Benchmark
- Kafka
- Kafka Connect
- KMS
Expand Down
Loading