Skip to content

nuin/vcfclick

Repository files navigation

vcfclick

test PyPI PyPI downloads GitHub stars live demo

vcfclick turns VCF cohorts into local, queryable SQL databases for research labs and bioinformatics teams.

  • Ingest joint VCFs or batches of per-sample VCFs.
  • Query variants, genotypes, samples, and ingestions with SQL.
  • Run trio / family analysis (de-novo, recessive, dominant, compound-het) over a loaded pedigree, with gnomAD population-AF rarity filtering — validated against the GIAB benchmark trio.
  • Check per-sample quality with db qc (het/hom, Ti/Tv, chrX sex check).
  • Combine multiple callers' call sets with set= provenance — the GATK3 CombineVariants that GATK4 removed.
  • Explore cohorts in an optional terminal UI, or a local browser UI (SQL, natural-language→SQL, trio, combine).
  • Share databases as portable Parquet bundles.
  • Use MCP to let an LLM write visible, auditable SQL.

Status: research preview. vcfclick is intended for exploratory research workflows, not clinical reporting.

Try It First

Try the live browser demo.

The browser demo runs DuckDB-Wasm over a public 1000 Genomes Parquet cohort. It does not require installing vcfclick. It is a quick way to see the core interaction: ask a genomics question, inspect the generated SQL, and view the result.

Run the demo notebook in Colab (examples/vcfclick-demo.ipynb) — a self-contained notebook that installs vcfclick, downloads its own VCFs, and runs the headline flows: ingest → SQL, GIAB-validated trio de novo, compound-het + gnomAD rarity filtering, sample QC, and combine.

The installable CLI is different: it creates local databases under your VCFCLICK_HOME (default ~/.vcfclick) and can use embedded chDB (ClickHouse engine) or DuckDB as the storage backend.

Install

vcfclick is a Python CLI installed as a standalone tool — no project, virtualenv, or pip juggling. Use uv (recommended) or pipx:

uv tool install vcfclick      # or: pipx install vcfclick
vcfclick --help

With the optional terminal UI or local web UI:

uv tool install "vcfclick[tui]"     # terminal UI: vcfclick tui
uv tool install "vcfclick[web]"     # browser UI:  vcfclick web <db>

Upgrade or remove later:

uv tool upgrade vcfclick      # pipx: pipx upgrade vcfclick
uv tool uninstall vcfclick    # pipx: pipx uninstall vcfclick

From a source checkout:

git clone https://github.com/nuin/vcfclick.git
cd vcfclick
uv sync --extra tui --group dev
uv run vcfclick --help

Platform notes

vcfclick depends on native Python wheels (cyvcf2, chdb, duckdb, pyarrow), which uv / pipx resolve automatically. Prebuilt wheels cover common macOS arm64 and Linux x86_64 Python versions; if your platform builds cyvcf2 from source, install htslib development headers first.

There is no Homebrew or conda-forge package: chdb (the embedded ClickHouse engine) ships as a binary wheel with no source build, so a from-source formula isn't possible. uv tool install / pipx are the supported install paths and work the same on macOS and Linux.

30-Second CLI Demo

Pull the public BRCA1 demo bundle, then run a SQL query:

vcfclick db pull demo \
  https://github.com/nuin/vcfclick/releases/download/v0.1.0/1000g-brca1-demo.tar.gz

vcfclick db query demo \
  "SELECT count(DISTINCT (ingest_id, sample_id)) AS samples
   FROM genotypes
   WHERE chrom = 'chr17'
     AND pos BETWEEN 43044295 AND 43170245"

Open the same database in the TUI:

vcfclick tui --db demo

Documentation

Start here:

  • Getting started - install, pull the demo, run first SQL queries, launch the TUI.
  • User guide - create databases, ingest VCFs, query, inspect, compare, export, bundle, and restore.
  • Backends - chDB vs DuckDB, install paths, conda, and moving data between backends.
  • Terminal UI - install the Textual extra and use the Locus, Operations, and SQL panes.
  • Web UI - vcfclick web local browser interface: SQL explorer, natural-language→SQL, trio and combine panels.
  • MCP and annotations - configure an MCP client, load gene/ClinVar references, and use visible LLM-generated SQL.
  • Trio / family analysis - merge per-sample VCFs, load a pedigree, and report de-novo / recessive / dominant / compound-het candidates.
  • Trio validation - de-novo recovery against the GIAB benchmark trio with its high-confidence BED as ground truth, cross-checked with bcftools +mendelian2.
  • Combining call sets - merge multiple callers of the same cohort with set= provenance and consensus filtering (the GATK3 CombineVariants GATK4 removed).
  • Sample QC - db qc per-sample het/hom, Ti/Tv, and a chrX-heterozygosity sex check flagged against the pedigree.
  • Schema reference - table definitions, query conventions, sparse genotype rules, and common SQL patterns.
  • FAQ - common install, memory, query, backend, and data interpretation questions.

Project and contributor docs:

Core Concepts

One Database Per Cohort Or Project

The CLI manages named databases under:

~/.vcfclick/dbs/<name>/

Set VCFCLICK_HOME=/path/to/home if you want databases somewhere else.

Four Cohort Tables

Every database has the same logical tables:

Table Meaning
variants one row per (ingest_id, chrom, pos, ref, alt)
genotypes sparse non-reference sample calls only
samples one row per (ingest_id, sample_id)
ingestions one row per uploaded VCF or imported dump

The most important rule: genotypes is sparse. Homozygous-reference calls (0/0) are not stored. See schema query patterns before writing allele-frequency or hom-ref queries by hand.

Backend Choice

vcfclick can run on either backend:

  • chDB: embedded ClickHouse engine, default when installed, best fit for cohort-scale local databases.
  • DuckDB: embedded single-file backend, useful for conda/Bioconda packaging and lightweight environments.

Choose with:

VCFCLICK_BACKEND=chdb vcfclick db list
VCFCLICK_BACKEND=duckdb vcfclick db list

Backends use different on-disk formats. Move data between them with vcfclick db dump and vcfclick db ingest-parquet; details are in Backends.

Architecture

vcfclick separates sample/cohort data from reference annotations.

VCF / Parquet input
        |
        v
named vcfclick database
  - variants
  - genotypes
  - samples
  - ingestions
        |
        +-- SQL CLI / TUI
        +-- MCP tools for visible generated SQL

shared annotation store
  - gene coordinates
  - ClinVar significance table

Sample data lives in the selected backend for each named database. Annotation data lives in an embedded DuckDB reference store shared by the MCP tools.

Current Limits

  • Multi-allelic sites must be decomposed before ingest: bcftools norm -m - input.vcf.gz.
  • Pedigrees are loaded explicitly with vcfclick db ped (standard PED/FAM); vcfclick does not infer relationships from the VCF.
  • Trio analysis is candidate FILTERING, not variant calling; defensible de-novo needs db ingest --keep-reference. See Trio.
  • DuckDB backend support is useful but not identical to chDB support; some operations may be chDB-first.
  • The natural-language layer is meant to produce visible SQL, not to hide SQL from the user.

License

Apache License 2.0. See LICENSE and LICENSING.md.

About

A research-bioinformatics VCF database. Embedded ClickHouse engine, embedded DuckDB annotations, MCP natural-language layer.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors