Sync/extraction tooling for the OpenAlex scholarly metadata snapshot. The dataset itself lives on HuggingFace (Git LFS via Xet storage).
| Path | Description |
|---|---|
sync/ |
Python tooling — download from S3, extract relationship tables to Parquet, manage the snapshot |
openalex-snapshot/ |
Git submodule — source data and extracted tables |
python3 -m sync runs from this directory (the repo root). The submodule must be initialised so openalex-snapshot/data/ exists.
git clone https://github.com/Mearman/OpenAlex.git
cd OpenAlex
git submodule update --initpip install -r sync/requirements.txtOne idempotent command does everything — download sources from S3, extract Parquet, commit/push, and reconcile the HuggingFace dataset. There are no subcommands; re-running converges the local tree, git, and HF to the canonical state and resumes where it left off:
# Full sync (all entities)
python3 -m sync
# Limit to one entity
python3 -m sync --entity works
# Skip the HuggingFace upload (local extraction only)
python3 -m sync --no-upload
# Deep self-heal: content-verify sources and Parquet shards, re-fetching corruption
python3 -m sync --verify
# Split extraction across two machines
python3 -m sync --slice-index 0 --slice-total 2 # machine 1
python3 -m sync --slice-index 1 --slice-total 2 # machine 2Source files are saved as part_XXXX.jsonl.gz (renamed from S3's part_XXXX.gz so HuggingFace's dataset viewer detects the format).
The extractor derives each entity's schema by scanning the source data — there is no hardcoded field list. Scalar attributes (id, doi, title, language, publication year, type, FWCI, open-access and bibliographic metadata, …) are collected into a single main table per entity; every list- or dict-valued field becomes its own relationship table. Each source shard produces one Parquet file per table. The HuggingFace upload runs in the background, overlapping extraction — completed shards are pushed while later ones are still being written, so it adds little to the wall-clock rather than running as a serial tail. Both the extracted Parquet tables and the .jsonl.gz source shards are reconciled through the HuggingFace API (upload_large_folder), so the upload never depends on the data root being a local git checkout — it works just as well against a plain folder (e.g. an external drive). The data is never committed or pushed via local git: the LFS-tracked files would otherwise be staged through the LFS clean filter, copying every blob into a local object cache that need not (and on a small system disk cannot) hold the full dataset. The prune (deleting remote data files that no longer exist locally) runs once at the end against the final set (--no-prune to upload additively, --no-upload to skip HuggingFace entirely). The remote git history of the data is created by the API commits; when the data root is a git checkout, the final step realigns the local refs to match (a fetch + reset --mixed, which touches no working-tree files).
A resource governor detects CPU and RAM at startup and splits worker budgets across the concurrent stages so they don't oversubscribe the machine: extraction (CPU + RAM bound) is sized from available RAM and cores while reserving a small slice for the overlapping upload (network bound, so a few cores suffice), and the final upload pass — once extraction is done — gets the whole machine. --workers N overrides the extraction count and the upload fills the remaining cores.
data/{entity}/
updated_date=YYYY-MM-DD/part_XXXX.jsonl.gz # source data (from S3)
main/
{entity}__updated_date=...__part_XXXX.parquet # scalar attributes, one row per entity
{relationship_type}/
{entity}__updated_date=...__part_XXXX.parquet # one edge table per list/dict field
The schema is data-derived and committed to openalex.schema.json; re-scanning the data reproduces it deterministically, so a field's presence in the schema is decided by the data, not a hardcoded list. For example, works yields a main table (doi, title, language, year, type, FWCI, …) alongside relationship tables for abstracts, authorships, references, concepts, keywords, locations, and more.
| Host | https://huggingface.co/datasets/Mearman/OpenAlex |
| Format | JSONL source (.jsonl.gz) + Parquet tables (a main scalar-attribute table and relationship tables per entity) |
| License | CC0 (public domain) |
Works, Authors, Sources, Institutions, Publishers, Funders, Awards, Topics, Concepts, Fields, Subfields, Domains.