A Rust-native PostgreSQL extension for RDF, SPARQL, SHACL and OWL reasoning.
Treat Postgres as the storage + execution engine for your knowledge graph. Load Turtle, query via SPARQL, validate via SHACL, materialize inferences via OWL 2 RL — all addressable from any Postgres client.
pgRDF turns a single PostgreSQL instance into a complete semantic-web engine — dictionary-encoded hexastore storage, a SPARQL 1.1 query and update engine, a W3C-conformant SHACL Core validator, and an OWL 2 RL reasoner — with no sidecar triple store and no second system to operate. It grew, release over release, into the full SPARQL 1.1 surface (CONSTRUCT, DESCRIBE, property paths, aggregates, named graphs, the complete UPDATE algebra), genuine W3C SHACL Core conformance (25/25), and OWL 2 RL and RDFS materialisation — every release CI-built and signed with SLSA Build Provenance v1.
The everyday win is a compact semantic knowledge base that never leaves
PostgreSQL: load RDF, then reason over it, validate it, and query it in
place — composable semantic action chains, each a single function call, with
the same in-database ergonomics you already use for materialize and
validate. No sidecar triple store, no ETL, no second service to operate.
Scale is the ceiling, not the price of entry. The benchmarks push the limits
and teach us where they are — and each release improves on the last: a complete
8.2-billion-triple Wikidata truthy dump ingested into one instance, and the
full load → reason → query pipeline run end to end to a 112-million-quad
materialised LUBM-500 closure. But the typical deployment is a right-sized graph
you reason and validate in a single local container. See Benchmarks.
Everything below runs inside one PostgreSQL instance, addressable from any client — no sidecar store, no ETL.
SELECT / ASK over N-pattern basic graph patterns, lowered to SQL joins on a pinned, cross-product-proof plan.
- Filters — identity, boolean composition, term-type tests,
REGEX, numeric & typed comparison - Modifiers —
DISTINCT,LIMIT/OFFSET, type-awareORDER BY - Patterns — multi-triple
OPTIONAL,UNION,MINUS,VALUES, downstreamBIND - Aggregates —
COUNT/SUM/AVG/MIN/MAX/GROUP_CONCAT/SAMPLEwithGROUP BY/HAVING, including overUNION - CONSTRUCT and DESCRIBE (W3C §16.4 Concise Bounded Description)
- Property paths —
^+*?|, with a materialised-closure no-CTE fast path and a depth guard - Named graphs —
GRAPH <iri>andGRAPH ?g, composed across OPTIONAL / UNION / MINUS
INSERT / DELETE DATA, INSERT / DELETE WHERE, DELETE+INSERT WHERE, WITH <iri> scoping, and the graph lifecycle algebra (DROP / CLEAR / CREATE GRAPH × DEFAULT / NAMED / ALL).
Dictionary-encoded terms over a LIST-partitioned hexastore (SPO / POS / OSP covering indexes).
- Ingest — Turtle, TriG, N-Quads (
parse_turtle/parse_trig/parse_nquads), plus the parallel bulk loader (load_turtle(…, bulk_load => true)— 2.3–3.5× on a fresh load, new in v0.6.2) - Per-graph lifecycle —
drop/clear/copy/move_graph, with BIGINT and IRI overloads - Performance — cross-backend shared-memory dictionary cache, prepared-plan cache, prepared bulk-INSERT
pgrdf.materialize(graph, profile) forward-chains the closure (owl-rl or rdfs), refreshes planner statistics automatically so queries stay fast on the enlarged graph, and is idempotent across calls.
pgrdf.validate(data, shapes, mode) returns a real sh:ValidationReport as JSONB — genuine W3C SHACL Core conformance (25 / 25).
Honest scope. A few surfaces are gated on upstream crates, not defects: RDF 1.2 triple terms + crates.io publish (E-011 ·
gtfierro/reasonable#50) and SHACL-SPARQL constraint execution (E-012 ·rudof); themode => 'sparql'surface ships honest. Forward backlog: SPEC.pgRDF.LLD.v0.6-FUTURE.
| PostgreSQL | 14 · 15 · 16 · 17 (PG 18 deferred — pgrx 0.16 pin; ERRATA E-006) |
| Install | OCI — oras pull ghcr.io/styk-tv/pgrdf-bundle:0.6.17 (public, zero-cred; every digest SLSA-attested, verify with gh attestation verify oci://ghcr.io/styk-tv/pgrdf-bundle:<tag> --repo styk-tv/pgRDF) · tarballs (pg14–17 × amd64/arm64) · PGXN — pgxn install pgrdf. See INSTALL.md. |
| Current release | v0.6.17 — LATEST.md is authoritative at audit time |
| Docs | pgrdf.styk.tv — full v0.6.17 guide: the four pillars plus scale, process, and roadmap |
| Repo | styk-tv/pgRDF |
The four pillars compose into semantic action chains — not just import /
store / retrieve, but load → reason → validate → query, each step a single
function call, all inside one PostgreSQL session:
-- One-time install
CREATE EXTENSION pgrdf;
-- Load any Turtle file from the server-side filesystem
SELECT pgrdf.load_turtle('/fixtures/ontologies/foaf.ttl', 100);
-- → 631
-- See structured ingest stats (timing, cache hits, batches)
SELECT pgrdf.load_turtle_verbose('/fixtures/ontologies/prov.ttl', 200, 'http://www.w3.org/ns/prov#');
-- → {"triples": 1789, "dict_cache_hits": 4612, "dict_db_calls": 783, "quad_batches": 2, "elapsed_ms": 142.7}
-- Manage per-graph LIST partitions for cheap whole-graph drops
SELECT pgrdf.add_graph(42);
SELECT pgrdf.count_quads(42);
-- Inspect the dictionary directly
SELECT * FROM pgrdf._pgrdf_dictionary WHERE term_type = 1 LIMIT 5;-- Multi-pattern BGP, shared variables become joins
SELECT * FROM pgrdf.sparql(
'PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?p ?n ?m
WHERE { ?p foaf:name ?n .
?p foaf:mbox ?m }'
);
-- → {"p": "http://example.com/alice", "n": "Alice", "m": "mailto:a@x"}
-- FILTER over the BGP — identity, boolean composition, term-type tests
SELECT * FROM pgrdf.sparql(
'PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?s ?o
WHERE { ?s ?p ?o FILTER(isIRI(?o) && ?p = foaf:knows) }'
);
-- Numeric ordering + REGEX in a single query
SELECT * FROM pgrdf.sparql(
'PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?s ?n
WHERE { ?s foaf:name ?n .
?s <http://example.com/age> ?age
FILTER(?age >= 30 && REGEX(?n, "^A", "i")) }'
);
-- OPTIONAL — mbox stays NULL when the person has no foaf:mbox
SELECT * FROM pgrdf.sparql(
'PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?s ?n ?m
WHERE { ?s foaf:name ?n
OPTIONAL { ?s foaf:mbox ?m } }'
);
-- → {"s": "http://example.com/alice", "n": "Alice", "m": "mailto:a@x"}
-- → {"s": "http://example.com/bob", "n": "Bob", "m": null}
-- UNION — either branch contributes solutions; unbound vars come as null
SELECT * FROM pgrdf.sparql(
'PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?s ?n ?m
WHERE { { ?s foaf:name ?n }
UNION
{ ?s foaf:mbox ?m } }'
);
-- Aggregates with GROUP BY — count of triples per predicate
SELECT * FROM pgrdf.sparql(
'SELECT ?p (COUNT(?o) AS ?n)
WHERE { ?s ?p ?o }
GROUP BY ?p ORDER BY DESC(?n)'
);
-- → {"p": "http://xmlns.com/foaf/0.1/name", "n": "4"}
-- Named-graph SPARQL — GRAPH ?g binds the graph IRI per match
SELECT pgrdf.add_graph(101::bigint, 'http://example.org/g1');
SELECT pgrdf.add_graph(102::bigint, 'http://example.org/g2');
SELECT * FROM pgrdf.sparql(
'PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?g (COUNT(*) AS ?n)
WHERE { GRAPH ?g { ?s foaf:name ?n } }
GROUP BY ?g ORDER BY ?g'
);
-- → {"g": "http://example.org/g1", "n": "3"}
-- → {"g": "http://example.org/g2", "n": "2"}
-- Inspect the parsed shape without executing
SELECT pgrdf.sparql_parse('SELECT ?s WHERE { ?s ?p ?o OPTIONAL { ?s <http://x/n> ?n } }');
-- → {"form": "SELECT", ..., "unsupported_algebra": ["LeftJoin (OPTIONAL)"]}-- Load an ontology + some assertions
SELECT pgrdf.add_graph(100);
SELECT pgrdf.parse_turtle('
@prefix ex: <http://example.com/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
ex:Engineer rdfs:subClassOf ex:Person .
ex:Person rdfs:subClassOf ex:Agent .
ex:alice rdf:type ex:Engineer .
', 100);
-- Materialize OWL 2 RL entailments. Idempotent — call as often as
-- you like; the prior is_inferred=TRUE rows are dropped first.
SELECT pgrdf.materialize(100);
-- → {"base_triples": 3, "inferred_triples_written": 11, ...}
-- The 2-hop entailment is now in the table:
SELECT * FROM pgrdf.sparql(
'PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ex: <http://example.com/>
SELECT ?c WHERE { ex:alice rdf:type ?c }'
);
-- → {"c": "http://example.com/Engineer"} ← base
-- → {"c": "http://example.com/Person"} ← inferred
-- → {"c": "http://example.com/Agent"} ← inferredSee guide/03-querying.md for the full
SELECT/ASK surface (BGPs with N patterns, FILTER expressions,
solution modifiers, OPTIONAL / UNION / MINUS, aggregates with
HAVING, BIND for projection, combining with regular SQL). For
operator-facing observability — pgrdf.stats(),
pgrdf.shmem_reset(), pgrdf.plan_cache_clear() — see
docs/02-storage.md.
These runs push the limits and teach us at scale — they map the ceiling and
drive each release's gains, but they are not the typical deployment. Most
users want a compact semantic knowledge base, operational directly inside the
database — load, reason, validate, and query a right-sized graph in place,
exactly as you already do with materialize and validate, with no separate
service. Two complementary proofs frame the envelope: raw ingest at scale
(the full Wikidata dump, on a server) and the full semantic pipeline (load →
reason → query across the LUBM ladder, down to a single local container).
The native staged loader pgrdf.load_turtle_staged_run loads the
COMPLETE Wikidata truthy N-Triples dump — 8,199,708,346 triples (0
dropped) — into a single PostgreSQL instance: dictionary-encoded
(1,801,847,593 distinct terms), full SPO/POS/OSP hexastore, ~2.0 TB
on disk (heap 729 GB + indexes 1448 GB). It runs a native multi-backend
background-worker pipeline, committing per phase (parse → UNLOGGED
staging → parallel hash-aggregate dedup → resolve → concurrent index) so a
failure leaves a resume point instead of rolling back the whole load.
| host | cores / RAM | engine | ingest | rate |
|---|---|---|---|---|
| Azure E128ads_v7 | 128 vCPU / 1 TiB | v0.6.14 | 4 h 53 m | 466 K triples/s |
| Azure E64ads_v7 | 64 vCPU / 503 GiB · 3.4 TB disk | v0.6.14 | ~10.3 h | ~221 K triples/s |
The 128-core run is the published flagship — 466 K triples/s (per-phase:
STAGE 13.8 m · DICT 1 h 51 m · RESOLVE 2 h 00 m index · INDEX 31.9 m) — 37 %
faster than the v0.6.13 all-hash baseline (6 h 41 m / 340.7 K), the gain from
the T3 parallel STAGE COPY (13.8 m vs 1 h 41 m, 7.3× on 32 workers) and the
concurrent index build (31.9 m vs 1 h 43 m). The 64-core run proves the same
full load completes out-of-the-box on half the cores and a 3.4 TB disk: the
v0.6.14 loader self-tunes work_mem/parallelism to the host and adds a tunable
resolve strategy (index|hash|auto, default index), temp-spill routing,
parallel STAGE COPY, and adaptive self-tuning — so it finishes with no ENOSPC
where the old all-hash resolve would have spilled multi-TB.
This is raw ingest at scale — it does NOT include reasoning or
materialization (truthy statements are already-asserted direct claims,
nothing to infer). For the full load → reason → query pipeline, see the
LUBM benchmark below.
This is the proof of the full semantic pipeline — the reasoning/materialization step the raw Wikidata ingest deliberately omits.
pgRDF completes the full LUBM-100 benchmark — the standard, generator-verified benchmark for RDF stores (Lehigh University Benchmark, 100 universities, 14 reference queries) — in a local container with zero database tuning:
| Measured | Result |
|---|---|
| Load 13,879,970 triples (Turtle) | 3 min 29 s |
| OWL 2 RL reasoning → 22.5M facts, statistics refreshed automatically | 4 min 54 s |
| All 14 queries on the loaded graph | each ≤ 3 s |
| All 14 queries after reasoning | each ≤ 5 s |
Environment: a local stock postgres:17.4-bookworm Docker container
(8 vCPU / 32 GiB), default PostgreSQL configuration — no manual indexes, no
ANALYZE, no planner hints, no extension settings. That contrast is the point:
the load → reason → query semantic pipeline — OWL 2 RL materialisation plus
all 14 LUBM queries, each result correctness-gated — completes in a single
local container on everyday hardware, a categorically different proof from the
raw 8.2-billion-triple Wikidata ingest above (a pure load test that scales out
to a 128-core server). Reasoning and validation run comfortably in a local
container; only billion-scale ingest needs the big box. Full per-query tables
and methodology:
tests/perf/lubm/RESULTS.m4-join-order.md.
Run end to end across the full LUBM ladder on a dedicated 32-vCPU / 256 GiB
box (Azure Standard_E32as_v7, native PostgreSQL 17) — load → index →
OWL-RL materialise → SPARQL — with every result correctness-gated against
the known LUBM answer counts:
| LUBM-N | base triples | ingest | index | materialize (OWL-RL) * | total quads (closure) |
|---|---|---|---|---|---|
| 10 | 1.32M | 3s | 1s | 15s | 2.13M |
| 100 | 13.9M | 34s | 8s | 4m 37s | 22.46M |
| 250 | 34.5M | 105s | 15s | 10m 9s | 55.88M |
| 500 | 69.1M | 192s | 47s | ~43m | 111.83M |
LUBM-500 builds a full materialised closure of 111.83 million quads on a single box (peak 146 / 256 GiB RAM) — load, reason, and query in one PostgreSQL instance, no sharding.
* OWL-RL materialisation is the dominant cost at scale and is single-thread-bound upstream — tracked in #1 (proposal: gtfierro/reasonable#57).
The ingest column above rides a loader family that evolved release over
release. The parallel bulk loader (v0.6.2,
pgrdf.load_turtle(…, bulk_load => true)) parses across all cores and
resolves triple→id in memory, delivering 2.3–3.5× over the serial path
(LUBM-250 240s → 105s, LUBM-500 667s → 192s) with per-triple cost staying
near-linear where the serial path was super-linear. For datasets beyond RAM,
pgrdf.load_turtle_streaming (v0.6.8) reads the file in bounded windows
(peak memory is one window plus the dictionary). For the largest loads,
pgrdf.load_turtle_staged_run (v0.6.11) drives the native, commit-per-phase
staged pipeline used for the Wikidata-scale run above.
Full walkthrough lives under guide/. Five-minute path:
# 1. Boot stock postgres:17.4 with the extension files bind-mounted
just build-ext # builds pgrdf.so/.control/.sql in a Linux container
just compose-up # podman compose up -d
just psql # opens a psql shell to the pgrdf database
# 2. Inside psql
pgrdf=# CREATE EXTENSION pgrdf;
pgrdf=# SELECT pgrdf.version();
-- → 0.6.17 (whatever LATEST.md currently advertises)
pgrdf=# SELECT pgrdf.parse_turtle('@prefix ex: <http://e.com/> . ex:a ex:p ex:b .', 1);
-- → 1pgRDF MUST be in shared_preload_libraries for _PG_init() to run in the
postmaster context. Without it, the extension's shared-memory atomics (dict
cache + plan-cache stats) are never registered, and the first call to any
pgRDF function panics with PgAtomic was not initialized.
# postgresql.conf
shared_preload_libraries = 'pgrdf' # pgRDF alone
# or:
shared_preload_libraries = 'pgrdf,pgck' # if pgCK is also installed
# — order matters: pgrdf firstA server restart (not just a reload) is required after editing this — preload happens at postmaster startup. Verify after restart:
SHOW shared_preload_libraries; -- must contain 'pgrdf'
SELECT pgrdf.parse_turtle(
'PREFIX ex: <http://example.org/> ex:t a ex:T .', 1::bigint, 'http://example.org/');
-- returns a row count, not a panicThe just compose-up Quickstart above bakes this into the bundled image;
only own-Postgres installs need to edit postgresql.conf manually.
Want to integrate from your application?
- Python —
guide/clients/python.md(psycopg + asyncpg, plus a sketch of using pgRDF as an rdflib backend) - Rust —
guide/clients/rust.md(tokio-postgres and sqlx examples) - Node.js / TypeScript —
guide/clients/typescript.md(pg,postgres.js,pg-cursorstreaming, typed bindings) - Go —
guide/clients/go.md(pgxv5,pgxpool, bulk-ingest pattern, sqlc tie-in)
Two parallel doc tracks:
Use documentation — guide/
For people running pgRDF in their applications.
- 00 — Introduction
- 01 — Install
- 02 — Loading RDF
- 03 — Querying with SPARQL
- Clients › Python
- Clients › Rust
- Clients › Node.js / TypeScript
- Clients › Go
Engineering / build plan — docs/
For people working on pgRDF itself.
- 01 — Architecture
- 02 — Storage
- 03 — Query
- 04 — Inference
- 05 — Validation
- 06 — Installation (spec walkthrough)
- 07 — Development
- 08 — Testing
- 09 — Release
- 10 — Roadmap
- SPEC.pgRDF.LLD.v0.5.md — current authoritative low-level design (supersedes v0.4)
- SPEC.pgRDF.LLD.v0.6-FUTURE.md — forward backlog (executor.rs core-BGP carve,
heap_multi_insertphase B, real SHACL-SPARQL engine, federated SERVICE, incremental materialisation, RDF 1.2) - SPEC.pgRDF.LLD.v0.3.md — historical (§4.1/§4.2/§4.3 internals still referenced)
- SPEC.pgRDF.INSTALL.v0.2.md — runtime install on stock PG containers
- ERRATA.v0.5.md / ERRATA.v0.4.md / ERRATA.v0.2.md — corrections + documented upstream gates discovered during implementation
| Layer | What it gates | Run |
|---|---|---|
| pgrx integration | UDF correctness inside a managed PG | just test |
| pg_regress-style | UDF correctness over the wire to compose Postgres | just test-regression |
| Artifact parity | Mounted extension bytes match a fresh build and the live container | just test-artifact-parity |
| W3C-shape SPARQL | Per-test data.ttl + query.rq vs expected.jsonl | just test-w3c |
| LUBM-shape | LUBM-style correctness gates against a hand-authored fixture | just test-lubm |
| Ontology smoke | Real-world Turtle parses cleanly | tests/perf/smoke-ontologies.sh |
| Narrow bar | just test + just test-regression (back-compat shape) |
just test-all |
| Compose-based bar | regression + W3C-shape + LUBM-shape | just test-conformance |
| Full bar | pgrx integration + test-conformance — the broadest sweep | just test-everything |
| Cold-compose smoke | Wipe compose, rebuild, re-up, run test-conformance | just smoke-cold |
just test-everything is the comprehensive entry point; just smoke-cold is the cold-compose verification (it now includes
artifact-parity proof after rebuild, before the compose-based test
bar). Use it after touching anything in compose/, fixtures/, or
the test SQL fixtures.
Current bar — 294 pgrx + 93 pg_regress + 51 W3C-sparql + 25 W3C SHACL Core + 3 LUBM-shape green across the full pgrx PG 14-17 matrix and the compose-based regression runtime (PG 17). Covers:
- Storage CRUD + Turtle / TriG / N-Quads ingest.
- The full SPARQL 1.1 SELECT/ASK/CONSTRUCT/DESCRIBE surface (type-aware ORDER BY, multi-triple OPTIONAL, UNION, MINUS, VALUES, downstream BIND, aggregates incl. over UNION, HAVING, property paths).
- SPARQL UPDATE (INSERT/DELETE DATA + WHERE, DELETE+INSERT,
WITHscoping, lifecycle algebra). - Storage performance (shmem dict cache, prepared-plan cache, prepared bulk-INSERT).
- OWL 2 RL + RDFS inference (
pgrdf.materialize,owl-rl/rdfsprofiles) + the materialize → SPARQL round-trip. - Genuine W3C SHACL Core validation (
pgrdf.validate) — 25/25 SHACL Core conformance, emitting a W3Csh:ValidationReportJSONB;mode=>'sparql'shipped + honest, upstream-gated (ERRATA E-012). - Named-graph surface (LLD v0.4 §3) —
_pgrdf_graphssystem table +pg_extension_config_dumpregistration for pg_dump round-trip; the five-UDF surface (add_graph(id)/add_graph(iri)/add_graph(id, iri)/graph_id(iri)/graph_iri(id)); SPARQLGRAPH <iri>literal +GRAPH ?gvariable forms with per-pattern scope composition over OPTIONAL / UNION / MINUS. Pg_regress fixtures72-79+87, pgrx tests insrc/storage/graphs.rs+src/query/executor.rs, W3C-shape fixtures24-graph-named-iri/25-graph-var-projection/26-graph-var-groupby, and thetests/regression/scripts/pg-dump-roundtrip.shshell-driven end-to-end round-trip. - Operator surface (
pgrdf.stats()JSONB shape contract). - 7 negative regression signals locking the error-message
contract for unsupported SPARQL shapes
(
80-unsupported-shapes.sql). - Error-path signals locking the stable error-prefix UDFs emit
on invalid input (
81-error-paths.sql); first lock-in:load_turtle: failed to openon a missing path. - Edge-case correctness signals (
62-materialize-empty.sql→ forward):pgrdf.materialize()on an empty graph returnsbase_triples = 0, non-negative inferred-count, and stays idempotent across two calls.
External smoke covers 24 well-known ontologies → 17,134 triples
(W3C, Apache Jena, ValueFlows, ConceptKernel v3.7); runs via
tests/perf/smoke-ontologies.sh. Per-ontology triple counts are
locked in tests/perf/smoke-ontologies.expected.tsv;
tests/perf/smoke-ontologies.sh --check re-runs the smoke and
diffs against the lock-file (not gated in CI yet — the fetched
payloads are gitignored). Workflow.ttl held out due to a non-RFC
IRI in the source — see
ERRATA E-007 / TEST.ONTOLOGY-SET.md.
Copyright 2026 Peter Styk. Licensed under the MIT License — see LICENSE for the canonical attribution.
Project home: https://github.com/styk-tv/pgRDF.
