OpenAlex Research Data

Sync/extraction tooling for the OpenAlex scholarly metadata snapshot. The dataset itself lives on HuggingFace (Git LFS via Xet storage).

What's here

Path	Description
`sync/`	Python tooling — download from S3, extract relationship tables to Parquet, manage the snapshot
`openalex-snapshot/`	Git submodule — source data and extracted tables

Quick Start

python3 -m sync runs from this directory (the repo root). The submodule must be initialised so openalex-snapshot/data/ exists.

git clone https://github.com/Mearman/OpenAlex.git
cd OpenAlex
git submodule update --init

Install dependencies

pip install -r sync/requirements.txt

Run the sync

One idempotent command does everything — download sources from S3, extract Parquet, commit/push, and reconcile the HuggingFace dataset. There are no subcommands; re-running converges the local tree, git, and HF to the canonical state and resumes where it left off:

# Full sync (all entities)
python3 -m sync

# Limit to one entity
python3 -m sync --entity works

# Skip the HuggingFace upload (local extraction only)
python3 -m sync --no-upload

# Deep self-heal: content-verify sources and Parquet shards, re-fetching corruption
python3 -m sync --verify

# Split extraction across two machines
python3 -m sync --slice-index 0 --slice-total 2   # machine 1
python3 -m sync --slice-index 1 --slice-total 2   # machine 2

Source files are saved as part_XXXX.jsonl.gz (renamed from S3's part_XXXX.gz so HuggingFace's dataset viewer detects the format).

The extractor derives each entity's schema by scanning the source data — there is no hardcoded field list. Scalar attributes (id, doi, title, language, publication year, type, FWCI, open-access and bibliographic metadata, …) are collected into a single main table per entity; every list- or dict-valued field becomes its own relationship table. Each source shard produces one Parquet file per table. The HuggingFace upload runs in the background, overlapping extraction — completed shards are pushed while later ones are still being written, so it adds little to the wall-clock rather than running as a serial tail. Both the extracted Parquet tables and the .jsonl.gz source shards are reconciled through the HuggingFace API (upload_large_folder), so the upload never depends on the data root being a local git checkout — it works just as well against a plain folder (e.g. an external drive). The data is never committed or pushed via local git: the LFS-tracked files would otherwise be staged through the LFS clean filter, copying every blob into a local object cache that need not (and on a small system disk cannot) hold the full dataset. The prune (deleting remote data files that no longer exist locally) runs once at the end against the final set (--no-prune to upload additively, --no-upload to skip HuggingFace entirely). The remote git history of the data is created by the API commits; when the data root is a git checkout, the final step realigns the local refs to match (a fetch + reset --mixed, which touches no working-tree files).

A resource governor detects CPU and RAM at startup and splits worker budgets across the concurrent stages so they don't oversubscribe the machine: extraction (CPU + RAM bound) is sized from available RAM and cores while reserving a small slice for the overlapping upload (network bound, so a few cores suffice), and the final upload pass — once extraction is done — gets the whole machine. --workers N overrides the extraction count and the upload fills the remaining cores.

Entity layout

data/{entity}/
  updated_date=YYYY-MM-DD/part_XXXX.jsonl.gz       # source data (from S3)
  main/
    {entity}__updated_date=...__part_XXXX.parquet   # scalar attributes, one row per entity
  {relationship_type}/
    {entity}__updated_date=...__part_XXXX.parquet   # one edge table per list/dict field

The schema is data-derived and committed to openalex.schema.json; re-scanning the data reproduces it deterministically, so a field's presence in the schema is decided by the data, not a hardcoded list. For example, works yields a main table (doi, title, language, year, type, FWCI, …) alongside relationship tables for abstracts, authorships, references, concepts, keywords, locations, and more.

Dataset


Host	https://huggingface.co/datasets/Mearman/OpenAlex
Format	JSONL source (`.jsonl.gz`) + Parquet tables (a `main` scalar-attribute table and relationship tables per entity)
License	CC0 (public domain)

Entities

Works, Authors, Sources, Institutions, Publishers, Funders, Awards, Topics, Concepts, Fields, Subfields, Domains.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.githooks		.githooks
.github		.github
bin		bin
openalex-snapshot @ 48bf3a7		openalex-snapshot @ 48bf3a7
sync		sync
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
README.md		README.md
openalex.schema.json		openalex.schema.json
verify_schema.py		verify_schema.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenAlex Research Data

What's here

Quick Start

Install dependencies

Run the sync

Entity layout

Dataset

Entities

External links

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenAlex Research Data

What's here

Quick Start

Install dependencies

Run the sync

Entity layout

Dataset

Entities

External links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages