Choreo is a library of matchmaking compute, not a database. It extracts
structured sections from free-text profiles, embeds them, and computes
directional "who can help whom" similarity (one person's needs vs.
another's skills), then refines with LLM pair scoring to produce mutually
useful connections. All behavior is config-driven — switching matching modes
requires no code edits.
Every stage is a pure transform with a declared IO schema; all persistence
lives in adapters (the CLI + FileStore in this repo, Modal, or an external
app's own store).
| Mode | Shape | Trigger | Use case |
|---|---|---|---|
| Full cohort | N×N | main.py --group <g> |
match everyone in a community against everyone |
| Query match (B) | 1×M | --pipeline query_match --query '…' |
hot path: "find me a CTO who…" against a pre-built pool |
| Batch match (C) | M×N | --pipeline batch_match --members a,b |
periodic re-matching of a subset, surfacing only novel pairs |
Semantics in docs/reference/matching_modes.md.
- Directional cross-matching: asymmetric need-to-skill similarity — "how well can B help A?" is computed independently from "how well can A help B?"
- HyDE vocabulary bridging: an LLM rewrites each need ("make my installation respond to movement") into skill-vocabulary text so it embeds close to the matching skills
- Pure stages, pluggable IO: every stage (extract · hyde · embed · similarity · score · match · introduce · report) is a pure transform with a discoverable IO contract; chain them in-memory or via disk
- Incremental by content hash: embedding reuse is addressed per (user, section) cell — adding or editing one profile never re-embeds anyone else
- Novelty-aware batch matching: append-only match history excludes recently surfaced pairs within a configurable window
- Multi-signal blending: directional embedding similarity fused with batched, budgeted LLM pair scoring
- B-matching: fair degree distribution (
b_min–b_maxconnections per user, asymmetric member/pool caps in batch mode) - Config-only mode switching: flip between need/skill matching and symmetric social-connectivity matching by editing YAML
-
Setup
cp .env.example .env # add your OpenRouter key (https://openrouter.ai/settings/keys) uv sync # creates .venv from pyproject.toml + uv.lock
-
Add profiles — one
.txtper user, filename = user ID (alice.txt→ "alice"):- Folder mode: any folder, via
--input(artifacts written inside it) - Group mode:
data/<group>/raw/, via--group
- Folder mode: any folder, via
-
Run
# Full cohort uv run python main.py --input /path/to/folder --force uv run python main.py --group <group_name> --force # Query match (pool must exist first): JSON section mapping or raw text uv run python main.py --pipeline query_match --group <g> \ --query '{"needs": "someone who can build the agent backend"}' # Subset batch match (novel pairs only) uv run python main.py --pipeline batch_match --group <g> --members alice,bob uv run python main.py --list-pipelines
-
Results (under
data/<group>/outputs/or<folder>/outputs/):- Per-user reports:
outputs/<user_id>.json({"profile": md, "matches": md}) - Cohort summary + costs:
outputs/cohort.json,outputs/cost_report.json - Plots:
outputs/plots/(+ re-exportable raw arrays inplots/raw_data/) - Batch-mode reports go to
outputs/batch/(never clobber the full run)
- Per-user reports:
-
Tests (offline by default — fake LLM + embedder, no API key needed)
uv run pytest RUN_LLM_TESTS=1 uv run pytest tests/test_e2e_regression.py # live golden e2e
Three layers — adapters own all IO, stages stay pure:
Adapters (own ALL IO) Orchestration (choreo/runners.py) Core stages (pure transforms)
main.py (CLI + FileStore) → run_full_match() → extract · hyde · embed ·
deploy_modal.py (Modal) run_query_match() similarity (rectangular) ·
external app (own store) run_batch_match() score · match · introduce · report
- Schemas (
choreo/schemas.py): every stage's IO is a dataclass withto_dict/from_dict(ExtractedSections,EmbeddingsBundle,Edge, …) - Stage registry (
choreo/stages.py):describe_stage(name)returns each stage's IO contract at runtime; per-stageload/dumphelpers let stages chain in-memory or via disk - Store protocol (
choreo/store.py):FileStoreis the reference adapter; an external app implements the same protocol against its own DB - Entry at any stage: raw
Profiles, pre-sectioned input, or a pre-builtEmbeddingsBundle
Deep dive: docs/reference/stages_and_adapters.md. Full IO spec for wrapping Choreo as an external tool: choreo_IO.md.
# Programmatic API (what external apps import)
from choreo import run_full_match, run_query_match, run_batch_match, FileStore, load_config
run_full_match(profiles, config, store=FileStore("data/grp")) # or sections, or a bundle
run_query_match({"needs": "a CTO great at agents"}, pool_bundle, config)
run_batch_match(["alice", "bob"], pool_bundle, config,
excluded_pairs=store.get_match_history(window_months=6),
pool_sections={s.id: s.sections for s in store.get_sections()})- Ingest — load
.txtprofiles, content-hash for change detection - Extract — LLM pulls structured sections (default:
skills,vision,project,needs; controlled byactiveflags insection_prompt.yaml) - HyDE — for each cross-section weight (e.g.
needs_skills), an LLM rewrites the source section into target-section vocabulary; runs only whencross_section_weightsis non-empty - Embed — sections →
(n_users, n_sections, dim)tensor + HyDE tensors; full-size vectors stored on disk, MRL-truncated at compute time (re-tunable without re-embedding) - Similarity — rectangular at the core (source set × target set): same-section cosine (symmetric) + HyDE-bridged cross-section (asymmetric:
cross[i][j]= "how well can j's skills address i's needs"), weight-fused into a directional matrix; the square cohort path symmetrizes(dir + dir.T) / 2for selection. Absent sections are neutral (masked), not zero — which is what lets a needs-only query drop into the same machinery - Score — batched LLM pair scoring (N profiles per call → N·(N−1)/2 pairs), budgeted, novelty-aware
- Match — greedy b-matching on blended scores (
final = embed_weight·embed + llm_weight·llm) - Introduce + report — directional intros (what each person offers the other's project) + per-user reports, cohort summary, cost report, plots
The canonical config ships inside the package (choreo/defaults/config.yaml +
four prompt yamls; annotated schema in CLAUDE.md). Override per
use-case with a config dir (--config-dir / load_config(config_dir=…):
config.yaml deep-merges, prompt files replace) and/or per-call overrides
(--set dotted.key=value / load_config(overrides={…})). All models are
OpenRouter slugs (provider/model):
models:
embedding: "google/gemini-embedding-2-preview"
embedding_dimensions: 1536 # MRL truncation; null = full native size (3072)
extraction_llm: "google/gemini-3.1-flash-lite"
pair_llm: "google/gemini-3.1-flash-lite"
reasoning_effort: "low" # global default; pair scoring overrides to "medium"
hyde:
n_descriptors: 1 # HyDE phrasings per source section
recipe:
section_weights: # same-section (symmetric); negative = dissimilarity preferred
skills: -0.10
vision: 0.30
project: 0.30
needs: -0.10
cross_section_weights: # cross-section (DIRECTIONAL); "<source>_<target>"
needs_skills: 0.80
blending:
embed_weight: 0.35
llm_weight: 0.65
matching:
b_min: 3 # min connections per user (member side in batch mode)
b_max: 4
pool_b_max: null # batch mode: optional pool-side degree cap
novelty_window_months: 6 # batch mode: exclusion window for past matches
query: # Mode B defaults
top_k: 5
llm_rerank: true # false = pure-embedding, cheaperPrompt files in config/: section_prompt.yaml (section definitions with
active flags), hyde_prompt.yaml, scoring_prompt.yaml,
introduction_prompt.yaml.
To move between need/skill matching and symmetric social-connectivity matching:
section_prompt.yaml: flipactiveflags on the relevant sectionsconfig.yaml: adjustsection_weights; set/clearcross_section_weights(empty disables HyDE and directionality entirely)- Swap
scoring_prompt.yaml/introduction_prompt.yamlif the framing changes
| Module | Purpose |
|---|---|
schemas.py |
Typed IO dataclasses for every stage |
stages.py |
Stage registry: describe_stage, per-stage load/dump |
store.py |
Store protocol + FileStore reference adapter |
runners.py |
Public mode runners: full / query / batch |
query.py · batch_match.py |
Mode B and Mode C logic |
ingest.py · extract.py · hyde.py · embed.py |
Profile → sections → descriptors → embeddings |
candidate.py |
Rectangular + square fused similarity |
score.py · match.py |
LLM pair scoring · greedy b-matching |
introduction.py · report.py |
Directional intros · report data + writing |
llm.py · cost_tracker.py |
OpenRouter wrapper (caching, batching) · cost accounting |
tsne.py · visualize_similarity.py · score_correlation.py · raw_data.py |
Plots + raw plot data |
- Pair IDs are alphabetically sorted for stability (
alice_bob, neverbob_alice) —utils.stable_pair_id() - Embedding ownership stays in-repo: external stores hold the bundle (
EmbeddingsBundle.to_dict()) and hand it back; bundles carryembedding_model+dimprovenance, and a model mismatch raises or triggers a full re-embed - Determinism is load-bearing: anything feeding an LLM prompt or cache key iterates in sorted order; cache keys use sha256 (
utils.hash_text), never the salted builtinhash() - Directionality preserved at the core: the asymmetric matrix is never symmetrized during computation — symmetrization is an aggregation choice on the square cohort path only
- Backward compatible: empty
cross_section_weights→ no HyDE, fully symmetric mode
- Python ≥ 3.11, managed with uv
- OpenRouter API key (in
.envasOPENROUTER_API_KEY)
uv run modal deploy deploy_modal.py # deploy all functions
uv run modal run deploy_modal.py --input-dir data/test4 --force # legacy full runGranular endpoints persist a FileStore per group on the choreo-data Volume:
upsert_profiles(user_profiles_json, group) · query_match(payload_json, group) ·
batch_match(members_json, group) — see the header of deploy_modal.py and
choreo_IO.md §1.4. Requires a choreo-secrets Modal secret
holding OPENROUTER_API_KEY (add AWS_* keys to also push outputs to S3).