vcfclick turns VCF cohorts into local, queryable SQL databases for research labs and bioinformatics teams.
- Ingest joint VCFs or batches of per-sample VCFs.
- Query variants, genotypes, samples, and ingestions with SQL.
- Run trio / family analysis (de-novo, recessive, dominant, compound-het) over a loaded pedigree, with gnomAD population-AF rarity filtering — validated against the GIAB benchmark trio.
- Check per-sample quality with
db qc(het/hom, Ti/Tv, chrX sex check). - Combine multiple callers' call sets with
set=provenance — the GATK3CombineVariantsthat GATK4 removed. - Explore cohorts in an optional terminal UI, or a local browser UI (SQL, natural-language→SQL, trio, combine).
- Share databases as portable Parquet bundles.
- Use MCP to let an LLM write visible, auditable SQL.
Status: research preview. vcfclick is intended for exploratory research workflows, not clinical reporting.
The browser demo runs DuckDB-Wasm over a public 1000 Genomes Parquet cohort. It does not require installing vcfclick. It is a quick way to see the core interaction: ask a genomics question, inspect the generated SQL, and view the result.
Run the demo notebook in Colab
(examples/vcfclick-demo.ipynb) — a
self-contained notebook that installs vcfclick, downloads its own VCFs,
and runs the headline flows: ingest → SQL, GIAB-validated trio de novo,
compound-het + gnomAD rarity filtering, sample QC, and combine.
The installable CLI is different: it creates local databases under your
VCFCLICK_HOME (default ~/.vcfclick) and can use embedded chDB
(ClickHouse engine) or DuckDB as the storage backend.
vcfclick is a Python CLI installed as a standalone tool — no
project, virtualenv, or pip juggling. Use uv
(recommended) or pipx:
uv tool install vcfclick # or: pipx install vcfclick
vcfclick --helpWith the optional terminal UI or local web UI:
uv tool install "vcfclick[tui]" # terminal UI: vcfclick tui
uv tool install "vcfclick[web]" # browser UI: vcfclick web <db>Upgrade or remove later:
uv tool upgrade vcfclick # pipx: pipx upgrade vcfclick
uv tool uninstall vcfclick # pipx: pipx uninstall vcfclickFrom a source checkout:
git clone https://github.com/nuin/vcfclick.git
cd vcfclick
uv sync --extra tui --group dev
uv run vcfclick --helpvcfclick depends on native Python wheels (cyvcf2, chdb, duckdb,
pyarrow), which uv / pipx resolve automatically. Prebuilt wheels
cover common macOS arm64 and Linux x86_64 Python versions; if your
platform builds cyvcf2 from source, install htslib development
headers first.
There is no Homebrew or conda-forge package: chdb (the embedded
ClickHouse engine) ships as a binary wheel with no source build, so a
from-source formula isn't possible. uv tool install / pipx are the
supported install paths and work the same on macOS and Linux.
Pull the public BRCA1 demo bundle, then run a SQL query:
vcfclick db pull demo \
https://github.com/nuin/vcfclick/releases/download/v0.1.0/1000g-brca1-demo.tar.gz
vcfclick db query demo \
"SELECT count(DISTINCT (ingest_id, sample_id)) AS samples
FROM genotypes
WHERE chrom = 'chr17'
AND pos BETWEEN 43044295 AND 43170245"Open the same database in the TUI:
vcfclick tui --db demoStart here:
- Getting started - install, pull the demo, run first SQL queries, launch the TUI.
- User guide - create databases, ingest VCFs, query, inspect, compare, export, bundle, and restore.
- Backends - chDB vs DuckDB, install paths, conda, and moving data between backends.
- Terminal UI - install the Textual extra and use the Locus, Operations, and SQL panes.
- Web UI -
vcfclick weblocal browser interface: SQL explorer, natural-language→SQL, trio and combine panels. - MCP and annotations - configure an MCP client, load gene/ClinVar references, and use visible LLM-generated SQL.
- Trio / family analysis - merge per-sample VCFs, load a pedigree, and report de-novo / recessive / dominant / compound-het candidates.
- Trio validation - de-novo recovery against the
GIAB benchmark trio with its high-confidence BED as ground truth,
cross-checked with
bcftools +mendelian2. - Combining call sets - merge multiple callers of the
same cohort with
set=provenance and consensus filtering (the GATK3 CombineVariants GATK4 removed). - Sample QC -
db qcper-sample het/hom, Ti/Tv, and a chrX-heterozygosity sex check flagged against the pedigree. - Schema reference - table definitions, query conventions, sparse genotype rules, and common SQL patterns.
- FAQ - common install, memory, query, backend, and data interpretation questions.
Project and contributor docs:
- Examples - worked BRCA1 natural-language SQL session.
- Benchmarks - ingest performance measurements.
- Contributing - development setup, tests, releases.
- Citation - DOI and BibTeX.
- License rationale - Apache 2.0 and what it means.
The CLI manages named databases under:
~/.vcfclick/dbs/<name>/
Set VCFCLICK_HOME=/path/to/home if you want databases somewhere else.
Every database has the same logical tables:
| Table | Meaning |
|---|---|
variants |
one row per (ingest_id, chrom, pos, ref, alt) |
genotypes |
sparse non-reference sample calls only |
samples |
one row per (ingest_id, sample_id) |
ingestions |
one row per uploaded VCF or imported dump |
The most important rule: genotypes is sparse. Homozygous-reference
calls (0/0) are not stored. See schema query patterns
before writing allele-frequency or hom-ref queries by hand.
vcfclick can run on either backend:
- chDB: embedded ClickHouse engine, default when installed, best fit for cohort-scale local databases.
- DuckDB: embedded single-file backend, useful for conda/Bioconda packaging and lightweight environments.
Choose with:
VCFCLICK_BACKEND=chdb vcfclick db list
VCFCLICK_BACKEND=duckdb vcfclick db listBackends use different on-disk formats. Move data between them with
vcfclick db dump and vcfclick db ingest-parquet; details are in
Backends.
vcfclick separates sample/cohort data from reference annotations.
VCF / Parquet input
|
v
named vcfclick database
- variants
- genotypes
- samples
- ingestions
|
+-- SQL CLI / TUI
+-- MCP tools for visible generated SQL
shared annotation store
- gene coordinates
- ClinVar significance table
Sample data lives in the selected backend for each named database. Annotation data lives in an embedded DuckDB reference store shared by the MCP tools.
- Multi-allelic sites must be decomposed before ingest:
bcftools norm -m - input.vcf.gz. - Pedigrees are loaded explicitly with
vcfclick db ped(standard PED/FAM); vcfclick does not infer relationships from the VCF. - Trio analysis is candidate FILTERING, not variant calling; defensible
de-novo needs
db ingest --keep-reference. See Trio. - DuckDB backend support is useful but not identical to chDB support; some operations may be chDB-first.
- The natural-language layer is meant to produce visible SQL, not to hide SQL from the user.
Apache License 2.0. See LICENSE and LICENSING.md.