vcfclick

vcfclick turns VCF cohorts into local, queryable SQL databases for research labs and bioinformatics teams.

Ingest joint VCFs or batches of per-sample VCFs.
Query variants, genotypes, samples, and ingestions with SQL.
Run trio / family analysis (de-novo, recessive, dominant, compound-het) over a loaded pedigree, with gnomAD population-AF rarity filtering — validated against the GIAB benchmark trio.
Check per-sample quality with db qc (het/hom, Ti/Tv, chrX sex check).
Combine multiple callers' call sets with set= provenance — the GATK3 CombineVariants that GATK4 removed.
Explore cohorts in an optional terminal UI, or a local browser UI (SQL, natural-language→SQL, trio, combine).
Share databases as portable Parquet bundles.
Use MCP to let an LLM write visible, auditable SQL.

Status: research preview. vcfclick is intended for exploratory research workflows, not clinical reporting.

Try It First

The browser demo runs DuckDB-Wasm over a public 1000 Genomes Parquet cohort. It does not require installing vcfclick. It is a quick way to see the core interaction: ask a genomics question, inspect the generated SQL, and view the result.

Run the demo notebook in Colab (examples/vcfclick-demo.ipynb) — a self-contained notebook that installs vcfclick, downloads its own VCFs, and runs the headline flows: ingest → SQL, GIAB-validated trio de novo, compound-het + gnomAD rarity filtering, sample QC, and combine.

The installable CLI is different: it creates local databases under your VCFCLICK_HOME (default ~/.vcfclick) and can use embedded chDB (ClickHouse engine) or DuckDB as the storage backend.

Install

vcfclick is a Python CLI installed as a standalone tool — no project, virtualenv, or pip juggling. Use uv (recommended) or pipx:

uv tool install vcfclick      # or: pipx install vcfclick
vcfclick --help

With the optional terminal UI or local web UI:

uv tool install "vcfclick[tui]"     # terminal UI: vcfclick tui
uv tool install "vcfclick[web]"     # browser UI:  vcfclick web <db>

Upgrade or remove later:

uv tool upgrade vcfclick      # pipx: pipx upgrade vcfclick
uv tool uninstall vcfclick    # pipx: pipx uninstall vcfclick

From a source checkout:

git clone https://github.com/nuin/vcfclick.git
cd vcfclick
uv sync --extra tui --group dev
uv run vcfclick --help

Platform notes

vcfclick depends on native Python wheels (cyvcf2, chdb, duckdb, pyarrow), which uv / pipx resolve automatically. Prebuilt wheels cover common macOS arm64 and Linux x86_64 Python versions; if your platform builds cyvcf2 from source, install htslib development headers first.

There is no Homebrew or conda-forge package: chdb (the embedded ClickHouse engine) ships as a binary wheel with no source build, so a from-source formula isn't possible. uv tool install / pipx are the supported install paths and work the same on macOS and Linux.

30-Second CLI Demo

Pull the public BRCA1 demo bundle, then run a SQL query:

vcfclick db pull demo \
  https://github.com/nuin/vcfclick/releases/download/v0.1.0/1000g-brca1-demo.tar.gz

vcfclick db query demo \
  "SELECT count(DISTINCT (ingest_id, sample_id)) AS samples
   FROM genotypes
   WHERE chrom = 'chr17'
     AND pos BETWEEN 43044295 AND 43170245"

Open the same database in the TUI:

vcfclick tui --db demo

Documentation

Start here:

Getting started - install, pull the demo, run first SQL queries, launch the TUI.
User guide - create databases, ingest VCFs, query, inspect, compare, export, bundle, and restore.
Backends - chDB vs DuckDB, install paths, conda, and moving data between backends.
Terminal UI - install the Textual extra and use the Locus, Operations, and SQL panes.
Web UI - vcfclick web local browser interface: SQL explorer, natural-language→SQL, trio and combine panels.
MCP and annotations - configure an MCP client, load gene/ClinVar references, and use visible LLM-generated SQL.
Trio / family analysis - merge per-sample VCFs, load a pedigree, and report de-novo / recessive / dominant / compound-het candidates.
Trio validation - de-novo recovery against the GIAB benchmark trio with its high-confidence BED as ground truth, cross-checked with bcftools +mendelian2.
Combining call sets - merge multiple callers of the same cohort with set= provenance and consensus filtering (the GATK3 CombineVariants GATK4 removed).
Sample QC - db qc per-sample het/hom, Ti/Tv, and a chrX-heterozygosity sex check flagged against the pedigree.
Schema reference - table definitions, query conventions, sparse genotype rules, and common SQL patterns.
FAQ - common install, memory, query, backend, and data interpretation questions.

Project and contributor docs:

Examples - worked BRCA1 natural-language SQL session.
Benchmarks - ingest performance measurements.
Contributing - development setup, tests, releases.
Citation - DOI and BibTeX.
License rationale - Apache 2.0 and what it means.

Core Concepts

One Database Per Cohort Or Project

The CLI manages named databases under:

~/.vcfclick/dbs/<name>/

Set VCFCLICK_HOME=/path/to/home if you want databases somewhere else.

Four Cohort Tables

Every database has the same logical tables:

Table	Meaning
`variants`	one row per `(ingest_id, chrom, pos, ref, alt)`
`genotypes`	sparse non-reference sample calls only
`samples`	one row per `(ingest_id, sample_id)`
`ingestions`	one row per uploaded VCF or imported dump

The most important rule: genotypes is sparse. Homozygous-reference calls (0/0) are not stored. See schema query patterns before writing allele-frequency or hom-ref queries by hand.

Backend Choice

vcfclick can run on either backend:

chDB: embedded ClickHouse engine, default when installed, best fit for cohort-scale local databases.
DuckDB: embedded single-file backend, useful for conda/Bioconda packaging and lightweight environments.

Choose with:

VCFCLICK_BACKEND=chdb vcfclick db list
VCFCLICK_BACKEND=duckdb vcfclick db list

Backends use different on-disk formats. Move data between them with vcfclick db dump and vcfclick db ingest-parquet; details are in Backends.

Architecture

vcfclick separates sample/cohort data from reference annotations.

VCF / Parquet input
        |
        v
named vcfclick database
  - variants
  - genotypes
  - samples
  - ingestions
        |
        +-- SQL CLI / TUI
        +-- MCP tools for visible generated SQL

shared annotation store
  - gene coordinates
  - ClinVar significance table

Sample data lives in the selected backend for each named database. Annotation data lives in an embedded DuckDB reference store shared by the MCP tools.

Current Limits

Multi-allelic sites must be decomposed before ingest: bcftools norm -m - input.vcf.gz.
Pedigrees are loaded explicitly with vcfclick db ped (standard PED/FAM); vcfclick does not infer relationships from the VCF.
Trio analysis is candidate FILTERING, not variant calling; defensible de-novo needs db ingest --keep-reference. See Trio.
DuckDB backend support is useful but not identical to chDB support; some operations may be chDB-first.
The natural-language layer is meant to produce visible SQL, not to hide SQL from the user.

License

Apache License 2.0. See LICENSE and LICENSING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.github/workflows		.github/workflows
annotations		annotations
bench		bench
cli		cli
docs		docs
examples		examples
export		export
ingest		ingest
packaging/bioconda		packaging/bioconda
schema		schema
scripts		scripts
storage		storage
tests		tests
tui		tui
vcfclick_mcp		vcfclick_mcp
vcfclick_web		vcfclick_web
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSING.md		LICENSING.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vcfclick

Try It First

Install

Platform notes

30-Second CLI Demo

Documentation

Core Concepts

One Database Per Cohort Or Project

Four Cohort Tables

Backend Choice

Architecture

Current Limits

License

About

Uh oh!

Releases 14

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vcfclick

Try It First

Install

Platform notes

30-Second CLI Demo

Documentation

Core Concepts

One Database Per Cohort Or Project

Four Cohort Tables

Backend Choice

Architecture

Current Limits

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 14

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages