diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
new file mode 100644
index 000000000..a3fcce84b
--- /dev/null
+++ b/.claude/CLAUDE.md
@@ -0,0 +1,36 @@
+# Odds ML Orchestration (Claude Code Team)
+
+This repository is used to maintain an orchestrated pipeline for **sports odds statistical research and ML predictions** (spreads, moneylines, totals) with a strict **rolling 5-day freshness window**.
+
+## Non-negotiables
+
+- **Freshness SLO**: all datasets used for training and prediction must be derived from the last **5 days** of collected data. If inputs are stale, the pipeline must fail fast and trigger backfill.
+- **Canonical markets**:
+ - **Spreads**: store **one canonical value per game** (favorite perspective OR `spread_magnitude` + `favorite_team`). Never average ±spread rows.
+ - **Totals**: store **one canonical total** per game (not separate over/under rows).
+ - **Moneylines**: convert American odds to **implied probability** for aggregation and ML.
+
+See the project rules in `./.claude/rules/` for details.
+
+## Repeatable Agent Team Template (copy/paste prompt)
+
+Create an agent team for odds pipeline maintenance with these teammates and responsibilities. Require plan approval before implementing any schema or workflow changes. Put the lead into delegate mode after spawning.
+
+- **TeamLead (delegate mode)**: coordination only, creates tasks, assigns owners, synthesizes results.
+- **CollectorEngineer**: web scraping + API collectors, rate limits, idempotency, retries/backfills.
+- **NormalizationSteward**: canonicalization of spreads/totals/moneylines; dedupe; invariants/tests.
+- **DataFreshnessSRE**: rolling 5-day window enforcement; staleness detection; alerting/escalation.
+- **MLTrainerEngineer**: feature views; training; evaluation; prediction artifacts.
+- **CostQuotaAnalyst (optional)**: API credit/usage budgeting; schedule optimization.
+
+Approval criteria for TeamLead:
+- Reject any plan that changes market sign conventions.
+- Reject any plan that allows stale inputs to silently pass.
+- Reject any plan that introduces non-idempotent collectors.
+
+## Operational contract (GitHub Actions)
+
+GitHub Actions runs the scheduled pipeline. The code must support:
+- **Collect** → **Normalize/Validate** → **Train** → **Predict/Report** → **Freshness Guard**
+- Bounded backfill on staleness (5-day lookback).
+
diff --git a/.claude/hooks/task_gate.py b/.claude/hooks/task_gate.py
new file mode 100644
index 000000000..5cba2a20a
--- /dev/null
+++ b/.claude/hooks/task_gate.py
@@ -0,0 +1,47 @@
+from __future__ import annotations
+
+import json
+import os
+import subprocess
+import sys
+
+
+def _run(cmd: list[str]) -> subprocess.CompletedProcess:
+ return subprocess.run(cmd, capture_output=True, text=True)
+
+
+def main() -> int:
+ # Read hook JSON input (best-effort; TaskCompleted/TeammateIdle always fire).
+ try:
+ _ = json.loads(sys.stdin.read() or "{}")
+ except Exception:
+ pass
+
+ # Only run gates if the odds pipeline package is importable in this environment.
+ probe = _run([sys.executable, "-c", "import odds_pipeline"])
+ if probe.returncode != 0:
+ print("odds_pipeline not installed; skipping odds pipeline gates", file=sys.stderr)
+ return 0
+
+ v = _run([sys.executable, "-m", "odds_pipeline", "validate"])
+ if v.returncode != 0:
+ print("odds pipeline validation failed", file=sys.stderr)
+ print(v.stdout, file=sys.stderr)
+ print(v.stderr, file=sys.stderr)
+ return 2
+
+ # Only enforce freshness if DATABASE_URL is present (so local doc work isn't blocked).
+ if os.getenv("DATABASE_URL"):
+ f = _run([sys.executable, "-m", "odds_pipeline", "freshness-guard", "--window-days", "5"])
+ if f.returncode != 0:
+ print("freshness guard failed", file=sys.stderr)
+ print(f.stdout, file=sys.stderr)
+ print(f.stderr, file=sys.stderr)
+ return 2
+
+ return 0
+
+
+if __name__ == "__main__":
+ raise SystemExit(main())
+
diff --git a/.claude/rules/freshness-and-windows.md b/.claude/rules/freshness-and-windows.md
new file mode 100644
index 000000000..e5c755277
--- /dev/null
+++ b/.claude/rules/freshness-and-windows.md
@@ -0,0 +1,34 @@
+---
+paths:
+ - "odds/**"
+ - ".github/workflows/odds-*.yml"
+---
+
+# Data freshness + rolling window rules (5 days)
+
+This project’s ML outputs are only valid if inputs are **fresh** and the training/inference data is bounded to a **rolling 5-day window**.
+
+## Freshness SLO
+
+Fail the pipeline if any required input stream is stale.
+
+Recommended defaults (tune per sport/market cadence):
+- **Odds snapshots**: stale if `max(collected_at)` older than **180 minutes**
+- **Scores/finals**: stale if `max(collected_at)` older than **24 hours**
+
+## Rolling 5-day window
+
+All downstream datasets (features, training rows, prediction features) must be computable from the last **5 days** of canonical + raw inputs.
+
+Implementation requirements:
+- Every compute job must accept `--window-days 5` (default 5).
+- Normalization must support backfill with an explicit `--lookback-days 5`.
+- Any retention/pruning job must never delete within the active 5-day window.
+
+## Backfill on staleness
+
+If freshness checks fail:
+- run a bounded backfill (lookback 5 days)
+- re-run normalization + validation
+- re-check freshness before training/predicting
+
diff --git a/.claude/rules/ml-training-contracts.md b/.claude/rules/ml-training-contracts.md
new file mode 100644
index 000000000..4244a0ef3
--- /dev/null
+++ b/.claude/rules/ml-training-contracts.md
@@ -0,0 +1,25 @@
+---
+paths:
+ - "odds/**"
+---
+
+# ML training + prediction contracts
+
+This project is designed so that models can be trained and evaluated deterministically from canonical data in the last 5 days.
+
+## Requirements
+
+- Training jobs must:
+ - log the dataset time window used
+ - record training timestamp and model version identifier
+ - output evaluation metrics (at minimum: calibration/accuracy proxies appropriate to the target)
+- Prediction jobs must:
+ - refuse to run if freshness checks fail
+ - attach the model version + data window to every prediction artifact
+
+## Targets
+
+- **Spreads**: predict cover probability from the team perspective (requires consistent sign conventions).
+- **Moneylines**: predict win probability (compare to implied probs for edge).
+- **Totals**: predict over probability relative to the canonical total.
+
diff --git a/.claude/rules/odds-normalization.md b/.claude/rules/odds-normalization.md
new file mode 100644
index 000000000..72b4e36bc
--- /dev/null
+++ b/.claude/rules/odds-normalization.md
@@ -0,0 +1,43 @@
+---
+paths:
+ - "odds/**"
+---
+
+# Odds normalization rules (canonical markets)
+
+These rules prevent sign-convention errors and ensure math is consistent across collectors, analytics, and ML training.
+
+## Key distinction
+
+**Favorite/underdog is determined by spread sign; home/away is venue and independent.** Do not conflate them.
+
+## Spreads (one canonical value per game)
+
+Sportsbooks/APIs often return **two outcomes per event** with opposite signs (e.g., -7 and +7). Those represent the *same* market.
+
+Store exactly **one canonical record per event/book/collected_at** using either:
+
+- **Option A (allowed)**: store the **favorite spread** (always negative or 0).
+- **Option B (preferred)**: store `spread_magnitude` (always positive) and explicit `favorite_team`/`underdog_team`.
+
+Never average raw `point` values that include both + and -.
+
+## Totals (one canonical value per game)
+
+Over/Under are two prices on the same number. Store one `total` value plus `over_price`/`under_price`.
+
+## Moneylines (use implied probability for math)
+
+American odds must be converted to implied probability before any aggregation or modeling.
+
+For American odds \(o\):
+
+- If \(o < 0\): \(p = |o| / (|o| + 100)\)
+- If \(o > 0\): \(p = 100 / (o + 100)\)
+
+Never average American odds directly.
+
+## Line movement convention (favorite perspective)
+
+If tracking spread movement, compute deltas from the **favorite’s spread** (negative). This avoids mixing perspectives.
+
diff --git a/.claude/rules/storage-schema.md b/.claude/rules/storage-schema.md
new file mode 100644
index 000000000..6bb2c6ba1
--- /dev/null
+++ b/.claude/rules/storage-schema.md
@@ -0,0 +1,25 @@
+---
+paths:
+ - "odds/**"
+---
+
+# Storage and schema contract
+
+GitHub Actions runners are ephemeral. **All pipeline state must live in persistent storage**.
+
+## Required environment variables
+
+- `DATABASE_URL`: Postgres connection string for the persistent store.
+- `ODDS_API_KEY`: The Odds API key (or equivalent) for collectors.
+
+## Schema principles
+
+- **Raw tables**: append-only snapshots; never mutated in place.
+- **Canonical tables**: derived from raw via normalization; can be re-derived deterministically.
+- **Idempotency**: collectors must not create duplicates for the same `(source,event_id,market,bookmaker,collected_at)` tuple.
+- **Time zone**: store timestamps in UTC and only convert for presentation.
+
+## Market canonicalization
+
+Canonical tables must follow the rules in `odds-normalization.md`.
+
diff --git a/.claude/settings.json b/.claude/settings.json
new file mode 100644
index 000000000..f911c6db7
--- /dev/null
+++ b/.claude/settings.json
@@ -0,0 +1,29 @@
+{
+ "env": {
+ "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
+ },
+ "teammateMode": "in-process",
+ "hooks": {
+ "TaskCompleted": [
+ {
+ "hooks": [
+ {
+ "type": "command",
+ "command": "python \"$CLAUDE_PROJECT_DIR/.claude/hooks/task_gate.py\""
+ }
+ ]
+ }
+ ],
+ "TeammateIdle": [
+ {
+ "hooks": [
+ {
+ "type": "command",
+ "command": "python \"$CLAUDE_PROJECT_DIR/.claude/hooks/task_gate.py\""
+ }
+ ]
+ }
+ ]
+ }
+}
+
diff --git a/.claude/skills/betting-data-normalizing/SKILL.md b/.claude/skills/betting-data-normalizing/SKILL.md
new file mode 100644
index 000000000..096716b12
--- /dev/null
+++ b/.claude/skills/betting-data-normalizing/SKILL.md
@@ -0,0 +1,39 @@
+---
+name: betting-data-normalizing
+description: Mandatory normalization rules for spreads, totals, and moneylines. Use for ANY sports betting analytics or ML work in this repo.
+---
+
+# Betting Data Normalizing (repo standard)
+
+## Spreads
+
+- APIs/books often return two outcomes per game with opposite signs (e.g., -7 and +7). They represent the **same** spread.\n
+- Store **one canonical record per game**.\n
+- Recommended representation:\n
+ - `spread_magnitude`: always positive\n
+ - `favorite_team` and `underdog_team`\n
+ - prices for each side\n
+\n
+Never average raw point values that mix negative and positive spreads.
+
+## Totals
+
+- Store **one total** per game plus `over_price` and `under_price`.\n
+- Do not store separate Over/Under rows as separate totals.
+
+## Moneylines
+
+- Convert American odds to implied probability before doing math.\n
+\n
+If `odds < 0`:\n
+`p = abs(odds) / (abs(odds) + 100)`\n
+\n
+If `odds > 0`:\n
+`p = 100 / (odds + 100)`\n
+\n
+Never average American odds directly.
+
+## Movement convention
+
+Track spread movement from the **favorite’s perspective** (negative spread). This avoids mixing perspectives between teams.
+
diff --git a/.claude/skills/odds-collecting/SKILL.md b/.claude/skills/odds-collecting/SKILL.md
new file mode 100644
index 000000000..2fcf8269a
--- /dev/null
+++ b/.claude/skills/odds-collecting/SKILL.md
@@ -0,0 +1,28 @@
+---
+name: odds-collecting
+description: Collect scores and odds in a rolling window with retries, deduplication, and freshness guarantees.
+---
+
+# Odds Collecting (repo standard)
+
+## Goals
+
+- Keep data fresh for a rolling **5-day** ML window.\n
+- Make collectors **idempotent** and safe to rerun.\n
+- Track costs/quotas and avoid redundant polling.\n
+
+## Collector requirements
+
+- Always accept explicit arguments:\n
+ - `--lookback-days` (default 5)\n
+ - `--sport` (e.g., `basketball_ncaab`)\n
+ - `--regions` and `--markets` when applicable\n
+- Always write timestamps in UTC (`collected_at`).\n
+- Use `event_id` (or equivalent) as the primary dedupe key.\n
+- Handle rate limits with exponential backoff.\n
+
+## Freshness
+
+- Provide a `freshness_guard` command that fails when data is stale.\n
+- On staleness, run bounded backfill (lookback 5 days), then re-normalize.\n
+
diff --git a/.github/workflows/convetional-commit.yml b/.github/workflows/convetional-commit.yml
index 455265924..4da5f79bd 100644
--- a/.github/workflows/convetional-commit.yml
+++ b/.github/workflows/convetional-commit.yml
@@ -21,6 +21,6 @@ jobs:
pull-requests: read
steps:
- if: github.event_name != 'merge_group'
- uses: amannn/action-semantic-pull-request@48f256284bd46cdaab1048c3721360e808335d50 # v6.1.1
+ uses: amannn/action-semantic-pull-request@v7
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
diff --git a/.github/workflows/odds-collect.yml b/.github/workflows/odds-collect.yml
new file mode 100644
index 000000000..a8d1be02a
--- /dev/null
+++ b/.github/workflows/odds-collect.yml
@@ -0,0 +1,61 @@
+name: Odds pipeline - collect odds + normalize
+
+permissions: read-all
+
+on:
+ workflow_dispatch:
+ inputs:
+ sport:
+ description: Sport key (e.g. basketball_ncaab)
+ required: false
+ default: basketball_ncaab
+ regions:
+ description: Regions (comma-separated)
+ required: false
+ default: us
+ markets:
+ description: Markets (comma-separated)
+ required: false
+ default: h2h,spreads,totals
+ schedule:
+ - cron: "*/15 * * * *"
+
+jobs:
+ collect-odds:
+ runs-on: ubuntu-latest
+ steps:
+ - name: Check out repository
+ uses: actions/checkout@v4
+
+ - name: Install uv
+ run: |
+ curl -LsSf https://astral.sh/uv/install.sh | sh
+ echo "$HOME/.local/bin" >> "$GITHUB_PATH"
+
+ - name: Init schema (idempotent)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline.schema
+
+ - name: Collect odds snapshots
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ ODDS_API_KEY: ${{ secrets.ODDS_API_KEY }}
+ working-directory: odds
+ run: >
+ uv run python -m odds_pipeline collect-odds
+ --sport "${{ inputs.sport || 'basketball_ncaab' }}"
+ --regions "${{ inputs.regions || 'us' }}"
+ --markets "${{ inputs.markets || 'h2h,spreads,totals' }}"
+
+ - name: Normalize (rolling window)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline normalize --window-days 5
+
+ - name: Validate invariants (fast)
+ working-directory: odds
+ run: uv run python -m odds_pipeline validate
+
diff --git a/.github/workflows/odds-freshness-guard.yml b/.github/workflows/odds-freshness-guard.yml
new file mode 100644
index 000000000..498a8a65e
--- /dev/null
+++ b/.github/workflows/odds-freshness-guard.yml
@@ -0,0 +1,90 @@
+name: Odds pipeline - freshness guard
+
+permissions:
+ contents: read
+ issues: write
+
+on:
+ workflow_dispatch:
+ schedule:
+ - cron: "35 * * * *"
+
+jobs:
+ freshness-guard:
+ runs-on: ubuntu-latest
+ steps:
+ - name: Check out repository
+ uses: actions/checkout@v6
+
+ - name: Install uv
+ run: |
+ curl -LsSf https://astral.sh/uv/install.sh | sh
+ echo "$HOME/.local/bin" >> "$GITHUB_PATH"
+
+ - name: Init schema (idempotent)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline.schema
+
+ - name: Run freshness guard
+ id: guard
+ continue-on-error: true
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline freshness-guard --window-days 5
+
+ - name: Bounded backfill + re-check freshness
+ id: backfill
+ if: ${{ steps.guard.outcome != 'success' }}
+ continue-on-error: true
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ ODDS_API_KEY: ${{ secrets.ODDS_API_KEY }}
+ run: |
+ cd odds
+ uv run python -m odds_pipeline backfill --sport basketball_ncaab --lookback-days 5
+ uv run python -m odds_pipeline freshness-guard --window-days 5
+
+ - name: Open issue if stale
+ if: ${{ steps.guard.outcome != 'success' && steps.backfill.outcome != 'success' }}
+ uses: actions/github-script@v7
+ with:
+ script: |
+ const title = "Odds pipeline stale data detected";
+ const body = [
+ "Freshness guard failed in scheduled run.",
+ "",
+ "- Check `DATABASE_URL` connectivity",
+ "- Check collectors (odds + scores) and API quota",
+ "- Run bounded 5-day backfill then re-normalize",
+ ].join("\n");
+ const { data: issues } = await github.rest.issues.listForRepo({
+ owner: context.repo.owner,
+ repo: context.repo.repo,
+ state: "open",
+ labels: "odds-pipeline,stale-data",
+ }).catch(() => ({ data: [] }));
+ if (issues.length > 0) {
+ core.info("Existing stale-data issue already open; skipping create.");
+ return;
+ }
+ try {
+ await github.rest.issues.create({
+ owner: context.repo.owner,
+ repo: context.repo.repo,
+ title,
+ body,
+ labels: ["odds-pipeline", "stale-data"],
+ });
+ } catch (e) {
+ core.warning("Issue labels may not exist yet; creating issue without labels.");
+ await github.rest.issues.create({
+ owner: context.repo.owner,
+ repo: context.repo.repo,
+ title,
+ body,
+ });
+ }
+
diff --git a/.github/workflows/odds-kenpom.yml b/.github/workflows/odds-kenpom.yml
new file mode 100644
index 000000000..067d82f4f
--- /dev/null
+++ b/.github/workflows/odds-kenpom.yml
@@ -0,0 +1,51 @@
+name: Odds pipeline - KenPom metrics (daily)
+
+permissions: read-all
+
+on:
+ workflow_dispatch:
+ schedule:
+ - cron: "40 9 * * *"
+
+jobs:
+ kenpom:
+ runs-on: ubuntu-latest
+ steps:
+ - name: Check out repository
+ uses: actions/checkout@v6
+
+ - name: Install uv
+ run: |
+ curl -LsSf https://astral.sh/uv/install.sh | sh
+ echo "$HOME/.local/bin" >> "$GITHUB_PATH"
+
+ - name: Init schema (idempotent)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline.schema
+
+ - name: Collect KenPom ratings
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ KENPOM_EMAIL: ${{ secrets.KENPOM_EMAIL }}
+ KENPOM_PASSWORD: ${{ secrets.KENPOM_PASSWORD }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline collect-kenpom --season 2026 --metric-type pomeroy_ratings
+
+ - name: Collect KenPom efficiency
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ KENPOM_EMAIL: ${{ secrets.KENPOM_EMAIL }}
+ KENPOM_PASSWORD: ${{ secrets.KENPOM_PASSWORD }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline collect-kenpom --season 2026 --metric-type efficiency
+
+ - name: Collect KenPom four factors
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ KENPOM_EMAIL: ${{ secrets.KENPOM_EMAIL }}
+ KENPOM_PASSWORD: ${{ secrets.KENPOM_PASSWORD }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline collect-kenpom --season 2026 --metric-type four_factors
+
diff --git a/.github/workflows/odds-predict.yml b/.github/workflows/odds-predict.yml
new file mode 100644
index 000000000..25f3ee3c2
--- /dev/null
+++ b/.github/workflows/odds-predict.yml
@@ -0,0 +1,66 @@
+name: Odds pipeline - predict (hourly)
+
+permissions: read-all
+
+on:
+ workflow_dispatch:
+ inputs:
+ model_version:
+ description: Model version string (if empty, uses last trained timestamp)
+ required: false
+ default: ""
+ schedule:
+ - cron: "5 * * * *"
+
+jobs:
+ predict:
+ runs-on: ubuntu-latest
+ steps:
+ - name: Check out repository
+ uses: actions/checkout@v6
+
+ - name: Install uv
+ run: |
+ curl -LsSf https://astral.sh/uv/install.sh | sh
+ echo "$HOME/.local/bin" >> "$GITHUB_PATH"
+
+ - name: Init schema (idempotent)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline.schema
+
+ - name: Freshness guard (must pass)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline freshness-guard --window-days 5
+
+ - name: Train (if needed) and pick model version
+ id: model
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ shell: bash
+ run: |
+ set -euo pipefail
+ if [ -n "${{ inputs.model_version }}" ]; then
+ echo "model_version=${{ inputs.model_version }}" >> "$GITHUB_OUTPUT"
+ exit 0
+ fi
+ cd odds
+ uv run python -m odds_pipeline train --window-days 5 > /tmp/train.json
+ model_version="$(python -c 'import json; print(json.load(open(\"artifacts/model.json\"))[\"model_version\"])')"
+ echo "model_version=$model_version" >> "$GITHUB_OUTPUT"
+
+ - name: Predict
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline predict --window-days 5 --model-version "${{ steps.model.outputs.model_version }}"
+
+ - name: Upload predictions artifact
+ uses: actions/upload-artifact@v4
+ with:
+ name: odds-predictions
+ path: odds/artifacts/predictions.json
+
diff --git a/.github/workflows/odds-scores.yml b/.github/workflows/odds-scores.yml
new file mode 100644
index 000000000..f881c27e3
--- /dev/null
+++ b/.github/workflows/odds-scores.yml
@@ -0,0 +1,58 @@
+name: Odds pipeline - collect scores
+
+permissions: read-all
+
+on:
+ workflow_dispatch:
+ inputs:
+ sport:
+ description: Sport key (e.g. basketball_ncaab)
+ required: false
+ default: basketball_ncaab
+ days_from:
+ description: Rolling lookback days for scores
+ required: false
+ default: "5"
+ schedule:
+ - cron: "17 */6 * * *"
+
+jobs:
+ collect-scores:
+ runs-on: ubuntu-latest
+ steps:
+ - name: Check out repository
+ uses: actions/checkout@v6
+
+ - name: Install uv
+ run: |
+ curl -LsSf https://astral.sh/uv/install.sh | sh
+ echo "$HOME/.local/bin" >> "$GITHUB_PATH"
+
+ - name: Init schema (idempotent)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline.schema
+
+ - name: Collect rolling scores
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ ODDS_API_KEY: ${{ secrets.ODDS_API_KEY }}
+ working-directory: odds
+ run: >
+ uv run python -m odds_pipeline collect-scores
+ --sport "${{ inputs.sport || 'basketball_ncaab' }}"
+ --days-from "${{ inputs.days_from || '5' }}"
+
+ - name: Collect ESPN schedules/scores (supplement)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline collect-espn --lookback-days 5
+
+ - name: Collect Action Network odds page (supplement, best-effort)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline collect-action-network
+
diff --git a/.github/workflows/odds-train.yml b/.github/workflows/odds-train.yml
new file mode 100644
index 000000000..ce6338b73
--- /dev/null
+++ b/.github/workflows/odds-train.yml
@@ -0,0 +1,45 @@
+name: Odds pipeline - train (daily)
+
+permissions: read-all
+
+on:
+ workflow_dispatch:
+ schedule:
+ - cron: "25 10 * * *"
+
+jobs:
+ train:
+ runs-on: ubuntu-latest
+ steps:
+ - name: Check out repository
+ uses: actions/checkout@v6
+
+ - name: Install uv
+ run: |
+ curl -LsSf https://astral.sh/uv/install.sh | sh
+ echo "$HOME/.local/bin" >> "$GITHUB_PATH"
+
+ - name: Init schema (idempotent)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline.schema
+
+ - name: Freshness guard (must pass)
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline freshness-guard --window-days 5
+
+ - name: Train
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+ working-directory: odds
+ run: uv run python -m odds_pipeline train --window-days 5
+
+ - name: Upload model artifact
+ uses: actions/upload-artifact@v4
+ with:
+ name: odds-model
+ path: odds/artifacts/model.json
+
diff --git a/.github/workflows/pre-release.yml b/.github/workflows/pre-release.yml
index 5518483d2..f425280e2 100644
--- a/.github/workflows/pre-release.yml
+++ b/.github/workflows/pre-release.yml
@@ -10,19 +10,28 @@ on:
jobs:
pre-release:
- name: 'Verify MCP server schema unchanged'
+ name: 'Pre-release checks'
runs-on: ubuntu-latest
steps:
- name: Check out repository
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+ uses: actions/checkout@v6 # v6.0.2
with:
fetch-depth: 2
- name: Set up Node.js
- uses: actions/setup-node@6044e13b5dc448c55e2357c09f80417699197238 # v6.2.0
+ uses: actions/setup-node@v6 # v6.2.0
with:
cache: npm
node-version-file: '.nvmrc'
+ - name: Install dependencies
+ run: npm ci
+
+ - name: Generate documents
+ run: npm run docs
+
+ - name: Run docs smoke check
+ run: npm run docs:lint
+
- name: Verify server.json
run: npm run verify-server-json-version
diff --git a/.github/workflows/presubmit.yml b/.github/workflows/presubmit.yml
index d738d4bff..b7452c113 100644
--- a/.github/workflows/presubmit.yml
+++ b/.github/workflows/presubmit.yml
@@ -16,12 +16,12 @@ jobs:
steps:
- name: Check out repository
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+ uses: actions/checkout@v6 # v6.0.2
with:
fetch-depth: 2
- name: Set up Node.js
- uses: actions/setup-node@6044e13b5dc448c55e2357c09f80417699197238 # v6.2.0
+ uses: actions/setup-node@v6 # v6.2.0
with:
cache: npm
node-version-file: '.nvmrc'
@@ -38,12 +38,12 @@ jobs:
steps:
- name: Check out repository
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+ uses: actions/checkout@v6 # v6.0.2
with:
fetch-depth: 2
- name: Set up Node.js
- uses: actions/setup-node@6044e13b5dc448c55e2357c09f80417699197238 # v6.2.0
+ uses: actions/setup-node@v6 # v6.2.0
with:
cache: npm
node-version-file: '.nvmrc'
@@ -54,12 +54,15 @@ jobs:
- name: Generate documents
run: npm run docs
+ - name: Run docs smoke check
+ run: npm run docs:lint
+
- name: Check if autogenerated docs differ
run: |
diff_file=$(mktemp doc_diff_XXXXXX)
git diff --color > $diff_file
if [[ -s $diff_file ]]; then
- echo "Please update the documentation by running 'npm run generate-docs'. The following was the diff"
+ echo "Please update the documentation by running 'npm run docs'. The following was the diff"
cat $diff_file
rm $diff_file
exit 1
diff --git a/.github/workflows/publish-to-npm-on-tag.yml b/.github/workflows/publish-to-npm-on-tag.yml
index 76ad41a58..20f4036bd 100644
--- a/.github/workflows/publish-to-npm-on-tag.yml
+++ b/.github/workflows/publish-to-npm-on-tag.yml
@@ -25,12 +25,12 @@ jobs:
if: ${{ (github.event_name != 'workflow_dispatch') || (inputs.npm-publish && always()) }}
steps:
- name: Check out repository
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+ uses: actions/checkout@v6 # v6.0.2
with:
fetch-depth: 2
- name: Set up Node.js
- uses: actions/setup-node@6044e13b5dc448c55e2357c09f80417699197238 # v6.2.0
+ uses: actions/setup-node@v6 # v6.2.0
with:
cache: npm
node-version-file: '.nvmrc'
@@ -58,12 +58,12 @@ jobs:
if: ${{ (github.event_name != 'workflow_dispatch' && needs.publish-to-npm.result == 'success') || (inputs.mcp-publish && always()) }}
steps:
- name: Check out repository
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+ uses: actions/checkout@v6 # v6.0.2
with:
fetch-depth: 2
- name: Set up Node.js
- uses: actions/setup-node@6044e13b5dc448c55e2357c09f80417699197238 # v6.2.0
+ uses: actions/setup-node@v6 # v6.2.0
with:
cache: npm
node-version-file: '.nvmrc'
diff --git a/.github/workflows/run-tests.yml b/.github/workflows/run-tests.yml
index 8aa274f0d..923f71510 100644
--- a/.github/workflows/run-tests.yml
+++ b/.github/workflows/run-tests.yml
@@ -27,12 +27,12 @@ jobs:
- 24
steps:
- name: Check out repository
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+ uses: actions/checkout@v6 # v6.0.2
with:
fetch-depth: 2
- name: Set up Node.js
- uses: actions/setup-node@6044e13b5dc448c55e2357c09f80417699197238 # v6.2.0
+ uses: actions/setup-node@v6 # v6.2.0
with:
cache: npm
node-version: 22 # build works only with 22+.
@@ -47,7 +47,7 @@ jobs:
NODE_OPTIONS: '--max_old_space_size=4096'
- name: Set up Node.js
- uses: actions/setup-node@6044e13b5dc448c55e2357c09f80417699197238 # v6.2.0
+ uses: actions/setup-node@v6 # v6.2.0
with:
cache: npm
node-version: ${{ matrix.node }}
diff --git a/.gitignore b/.gitignore
index 2ea3f0b63..5ac817dc8 100644
--- a/.gitignore
+++ b/.gitignore
@@ -148,4 +148,9 @@ build/
log.txt
-.DS_Store
\ No newline at end of file
+.DS_Store
+
+# Claude Code config
+# Keep shared Claude Code project config in git, but ignore local-only files.
+.claude/settings.local.json
+.claude/agent-memory-local/
\ No newline at end of file
diff --git a/README.md b/README.md
index 3061d9519..e9ab6aaaf 100644
--- a/README.md
+++ b/README.md
@@ -119,6 +119,12 @@ Chrome DevTools MCP will not start the browser instance automatically using this
claude mcp add chrome-devtools --scope user npx chrome-devtools-mcp@latest
```
+**Don’t start automatically, preserve context, and defer tool use:** The Chrome DevTools MCP server does not start the browser until the first tool that needs it runs. To avoid loading all MCP tool definitions up front and to preserve context, use **MCP Tool Search** in Claude Code so tools are loaded on demand:
+
+- Claude Code turns on **Tool Search** automatically when your MCP tools would use more than 10% of the context window (see Scale with MCP Tool Search). When that happens, tools are deferred and Claude uses a search tool to load only the tools it needs.
+- To rely on this behavior, no extra config is required. To force it on (or adjust the threshold), use **Configure tool search** in Claude Code (e.g. run `/config` and check MCP / tool search options, or set enable_tool_search in your Claude Code settings as documented in the MCP docs).
+- Optionally add serverInstructions to your MCP server config so Tool Search can discover Chrome DevTools tools more reliably (e.g. describe browser automation, debugging, network inspection, performance traces).
+
diff --git a/docs/CALIBRATION_EXPERT_GUIDE.md b/docs/CALIBRATION_EXPERT_GUIDE.md
new file mode 100644
index 000000000..dbc7704b8
--- /dev/null
+++ b/docs/CALIBRATION_EXPERT_GUIDE.md
@@ -0,0 +1,73 @@
+# Model Calibration Expert Guide
+
+This document explains the methodology behind the context-aware calibration used in
+`scripts/prediction/apply_context_aware_calibration.py`.
+
+> Status: **draft** – calibration constants and procedures are expected to evolve as
+> more validation data becomes available.
+
+## Purpose
+
+The goal of context-aware calibration is to correct systematic biases in model
+totals predictions by conditioning adjustments on:
+
+- **Scoring range** (low/mid/high)
+- **Pace** (slow/moderate/fast)
+- **Team quality** (elite defenses, large efficiency mismatches)
+
+Rather than adding the same offset to every game, the script computes a
+per-game adjustment based on these contextual signals.
+
+See the docstring and comments in `scripts/prediction/apply_context_aware_calibration.py`
+for the exact thresholds and constants currently in use.
+
+## Implementation Overview
+
+The calibration pipeline implemented in `apply_context_aware_calibration.py`:
+
+1. **Stores raw predictions**
+ - `predicted_total_raw`
+ - `predicted_home_score_raw`
+ - `predicted_away_score_raw`
+2. **Computes a context adjustment** via `calculate_context_adjustment`:
+ - Scoring band bias (low / mid / high totals)
+ - Pace bias (if `avg_tempo` is present)
+ - Elite defense bias (if `home_adj_d` / `away_adj_d` present)
+ - Mismatch bias (if `home_adj_em` / `away_adj_em` present)
+3. **Applies the adjustment**:
+ - New `predicted_total = predicted_total_raw + calibration_adjustment`
+ - Home/away scores shifted proportionally while keeping the margin intact
+4. **Logs aggregate behavior**:
+ - Average adjustment
+ - Adjustment range
+ - Count of low/mid/high-scoring games
+
+The script also records a human-readable `calibration_reasons` string that
+summarizes which contextual rules fired for each game.
+
+## When to Use This Script
+
+Use `apply_context_aware_calibration.py` **after** you have generated base model
+predictions and want to:
+
+- Reduce systematic over/underprediction in certain game types.
+- Preserve the underlying model ordering while nudging totals into better-calibrated
+ ranges.
+
+Example:
+
+```bash
+uv run python scripts/prediction/apply_context_aware_calibration.py \
+ --input predictions/2026-02-08_fresh.csv \
+ --output predictions/2026-02-08_context_calibrated.csv
+```
+
+## Future Work
+
+- Periodically recompute calibration constants from out-of-sample data.
+- Add automated reports that compare:
+ - Raw vs calibrated Brier score / log-loss.
+ - Calibration curves by scoring band and tempo band.
+- Consider integrating calibration into the main training pipeline once
+ behavior is stable.
+
diff --git a/docs/MODEL_CALIBRATION_FINDINGS.md b/docs/MODEL_CALIBRATION_FINDINGS.md
new file mode 100644
index 000000000..f62d8ee26
--- /dev/null
+++ b/docs/MODEL_CALIBRATION_FINDINGS.md
@@ -0,0 +1,65 @@
+# Model Calibration Findings - 2026-02-07
+
+This document summarizes the bias analysis that motivated the one-time
+calibration fix implemented in `scripts/prediction/apply_calibration_fix.py`.
+
+> Status: **snapshot** – describes the specific 2026-02-07 underprediction issue.
+> Future calibration work should extend this document or be captured in a new
+> dated section.
+
+## Background
+
+On 2026-02-07, evaluation of the totals model against actual game results
+identified a **systematic underprediction** of final scores:
+
+- Average miss vs final totals was approximately **+4.5 points** (actual higher).
+- Bias was consistent across a wide range of matchups and totals bands.
+
+To avoid re-training the model mid-slates, a lightweight correction layer was
+introduced that:
+
+- Preserves the model’s relative ordering between games.
+- Applies a uniform upward shift to totals predictions.
+
+## Fix Implemented
+
+The script `scripts/prediction/apply_calibration_fix.py`:
+
+1. Stores original predictions:
+ - `predicted_total_raw`
+ - `predicted_home_score_raw`
+ - `predicted_away_score_raw`
+2. Adds a **+4.5 point bias correction** to totals:
+ - `predicted_total = predicted_total_raw + 4.5`
+3. Redistributes the 4.5 points proportionally to home and away scores so that:
+ - The **margin** (`predicted_home_score - predicted_away_score`) is unchanged.
+4. Optionally (when `--validate` is used) computes warning flags for games that
+ diverge too far from:
+ - KenPom-derived totals (`kenpom_total`)
+ - Market totals (`market_total`)
+ - Recent scoring averages (`recent_avg_total`)
+
+Command-line usage:
+
+```bash
+uv run python scripts/prediction/apply_calibration_fix.py \
+ --input predictions/2026-02-07_raw.csv \
+ --output predictions/2026-02-07_calibrated.csv
+```
+
+## Interpretation
+
+- This correction should be treated as a **temporary hotfix**, not a substitute
+ for retraining with more data.
+- The value **+4.5** is specific to the 2026-02-07 analysis window; future
+ evaluation may justify a different value or a more nuanced, context-aware
+ approach (see `docs/CALIBRATION_EXPERT_GUIDE.md`).
+
+## Next Steps
+
+- Regularly recompute calibration bias on rolling windows.
+- Prefer context-aware calibration (`apply_context_aware_calibration.py`) once
+ its methodology is fully validated.
+- Integrate calibration metrics (e.g., Brier score, calibration curves) into the
+ standard model evaluation pipeline.
+
diff --git a/docs/claude-code-docs.md b/docs/claude-code-docs.md
new file mode 100644
index 000000000..cfea3ecca
--- /dev/null
+++ b/docs/claude-code-docs.md
@@ -0,0 +1,59 @@
+# Claude Code Docs
+
+## Docs
+
+- [Orchestrate teams of Claude Code sessions](https://code.claude.com/docs/en/agent-teams.md): Coordinate multiple Claude Code instances working together as a team, with shared tasks, inter-agent messaging, and centralized management.
+- [Claude Code on Amazon Bedrock](https://code.claude.com/docs/en/amazon-bedrock.md): Learn about configuring Claude Code through Amazon Bedrock, including setup, IAM configuration, and troubleshooting.
+- [Track team usage with analytics](https://code.claude.com/docs/en/analytics.md): View Claude Code usage metrics, track adoption, and measure engineering velocity in the analytics dashboard.
+- [Authentication](https://code.claude.com/docs/en/authentication.md): Learn how to configure user authentication and credential management for Claude Code in your organization.
+- [Best Practices for Claude Code](https://code.claude.com/docs/en/best-practices.md): Tips and patterns for getting the most out of Claude Code, from configuring your environment to scaling across parallel sessions.
+- [Changelog](https://code.claude.com/docs/en/changelog.md)
+- [Checkpointing](https://code.claude.com/docs/en/checkpointing.md): Track, rewind, and summarize Claude's edits and conversation to manage session state.
+- [Use Claude Code with Chrome (beta)](https://code.claude.com/docs/en/chrome.md): Connect Claude Code to your Chrome browser to test web apps, debug with console logs, automate form filling, and extract data from web pages.
+- [Claude Code on the web](https://code.claude.com/docs/en/claude-code-on-the-web.md): Run Claude Code tasks asynchronously on secure cloud infrastructure
+- [CLI reference](https://code.claude.com/docs/en/cli-reference.md): Complete reference for Claude Code command-line interface, including commands and flags.
+- [Common workflows](https://code.claude.com/docs/en/common-workflows.md): Step-by-step guides for exploring codebases, fixing bugs, refactoring, testing, and other everyday tasks with Claude Code.
+- [Manage costs effectively](https://code.claude.com/docs/en/costs.md): Track token usage, set team spend limits, and reduce Claude Code costs with context management, model selection, extended thinking settings, and preprocessing hooks.
+- [Data usage](https://code.claude.com/docs/en/data-usage.md): Learn about Anthropic's data usage policies for Claude
+- [Claude Code on desktop](https://code.claude.com/docs/en/desktop.md): Run Claude Code tasks locally or on secure cloud infrastructure with the Claude desktop app
+- [Development containers](https://code.claude.com/docs/en/devcontainer.md): Learn about the Claude Code development container for teams that need consistent, secure environments.
+- [Discover and install prebuilt plugins through marketplaces](https://code.claude.com/docs/en/discover-plugins.md): Find and install plugins from marketplaces to extend Claude Code with new commands, agents, and capabilities.
+- [Speed up responses with fast mode](https://code.claude.com/docs/en/fast-mode.md): Get faster Opus 4.6 responses in Claude Code by toggling fast mode.
+- [Extend Claude Code](https://code.claude.com/docs/en/features-overview.md): Understand when to use CLAUDE.md, Skills, subagents, hooks, MCP, and plugins.
+- [Claude Code GitHub Actions](https://code.claude.com/docs/en/github-actions.md): Learn about integrating Claude Code into your development workflow with Claude Code GitHub Actions
+- [Claude Code GitLab CI/CD](https://code.claude.com/docs/en/gitlab-ci-cd.md): Learn about integrating Claude Code into your development workflow with GitLab CI/CD
+- [Claude Code on Google Vertex AI](https://code.claude.com/docs/en/google-vertex-ai.md): Learn about configuring Claude Code through Google Vertex AI, including setup, IAM configuration, and troubleshooting.
+- [Run Claude Code programmatically](https://code.claude.com/docs/en/headless.md): Use the Agent SDK to run Claude Code programmatically from the CLI, Python, or TypeScript.
+- [Hooks reference](https://code.claude.com/docs/en/hooks.md): Reference for Claude Code hook events, configuration schema, JSON input/output formats, exit codes, async hooks, prompt hooks, and MCP tool hooks.
+- [Automate workflows with hooks](https://code.claude.com/docs/en/hooks-guide.md): Run shell commands automatically when Claude Code edits files, finishes tasks, or needs input. Format code, send notifications, validate commands, and enforce project rules.
+- [How Claude Code works](https://code.claude.com/docs/en/how-claude-code-works.md): Understand the agentic loop, built-in tools, and how Claude Code interacts with your project.
+- [Interactive mode](https://code.claude.com/docs/en/interactive-mode.md): Complete reference for keyboard shortcuts, input modes, and interactive features in Claude Code sessions.
+- [JetBrains IDEs](https://code.claude.com/docs/en/jetbrains.md): Use Claude Code with JetBrains IDEs including IntelliJ, PyCharm, WebStorm, and more
+- [Customize keyboard shortcuts](https://code.claude.com/docs/en/keybindings.md): Customize keyboard shortcuts in Claude Code with a keybindings configuration file.
+- [Legal and compliance](https://code.claude.com/docs/en/legal-and-compliance.md): Legal agreements, compliance certifications, and security information for Claude Code.
+- [LLM gateway configuration](https://code.claude.com/docs/en/llm-gateway.md): Learn how to configure Claude Code to work with LLM gateway solutions. Covers gateway requirements, authentication configuration, model selection, and provider-specific endpoint setup.
+- [Connect Claude Code to tools via MCP](https://code.claude.com/docs/en/mcp.md): Learn how to connect Claude Code to your tools with the Model Context Protocol.
+- [Manage Claude's memory](https://code.claude.com/docs/en/memory.md): Learn how to manage Claude Code's memory across sessions with different memory locations and best practices.
+- [Claude Code on Microsoft Foundry](https://code.claude.com/docs/en/microsoft-foundry.md): Learn about configuring Claude Code through Microsoft Foundry, including setup, configuration, and troubleshooting.
+- [Model configuration](https://code.claude.com/docs/en/model-config.md): Learn about the Claude Code model configuration, including model aliases like `opusplan`
+- [Monitoring](https://code.claude.com/docs/en/monitoring-usage.md): Learn how to enable and configure OpenTelemetry for Claude Code.
+- [Enterprise network configuration](https://code.claude.com/docs/en/network-config.md): Configure Claude Code for enterprise environments with proxy servers, custom Certificate Authorities (CA), and mutual Transport Layer Security (mTLS) authentication.
+- [Output styles](https://code.claude.com/docs/en/output-styles.md): Adapt Claude Code for uses beyond software engineering
+- [Claude Code overview](https://code.claude.com/docs/en/overview.md): Claude Code is an agentic coding tool that reads your codebase, edits files, runs commands, and integrates with your development tools. Available in your terminal, IDE, desktop app, and browser.
+- [Configure permissions](https://code.claude.com/docs/en/permissions.md): Control what Claude Code can access and do with fine-grained permission rules, modes, and managed policies.
+- [Create and distribute a plugin marketplace](https://code.claude.com/docs/en/plugin-marketplaces.md): Build and host plugin marketplaces to distribute Claude Code extensions across teams and communities.
+- [Create plugins](https://code.claude.com/docs/en/plugins.md): Create custom plugins to extend Claude Code with skills, agents, hooks, and MCP servers.
+- [Plugins reference](https://code.claude.com/docs/en/plugins-reference.md): Complete technical reference for Claude Code plugin system, including schemas, CLI commands, and component specifications.
+- [Quickstart](https://code.claude.com/docs/en/quickstart.md): Welcome to Claude Code!
+- [Sandboxing](https://code.claude.com/docs/en/sandboxing.md): Learn how Claude Code's sandboxed bash tool provides filesystem and network isolation for safer, more autonomous agent execution.
+- [Security](https://code.claude.com/docs/en/security.md): Learn about Claude Code's security safeguards and best practices for safe usage.
+- [Claude Code settings](https://code.claude.com/docs/en/settings.md): Configure Claude Code with global and project-level settings, and environment variables.
+- [Set up Claude Code](https://code.claude.com/docs/en/setup.md): Install, authenticate, and start using Claude Code on your development machine.
+- [Extend Claude with skills](https://code.claude.com/docs/en/skills.md): Create, manage, and share skills to extend Claude's capabilities in Claude Code. Includes custom slash commands.
+- [Claude Code in Slack](https://code.claude.com/docs/en/slack.md): Delegate coding tasks directly from your Slack workspace
+- [Customize your status line](https://code.claude.com/docs/en/statusline.md): Configure a custom status bar to monitor context window usage, costs, and git status in Claude Code
+- [Create custom subagents](https://code.claude.com/docs/en/sub-agents.md): Create and use specialized AI subagents in Claude Code for task-specific workflows and improved context management.
+- [Optimize your terminal setup](https://code.claude.com/docs/en/terminal-config.md): Claude Code works best when your terminal is properly configured. Follow these guidelines to optimize your experience.
+- [Enterprise deployment overview](https://code.claude.com/docs/en/third-party-integrations.md): Learn how Claude Code can integrate with various third-party services and infrastructure to meet enterprise deployment requirements.
+- [Troubleshooting](https://code.claude.com/docs/en/troubleshooting.md): Discover solutions to common issues with Claude Code installation and usage.
+- [Use Claude Code in VS Code](https://code.claude.com/docs/en/vs-code.md): Install and configure the Claude Code extension for VS Code. Get AI coding assistance with inline diffs, @-mentions, plan review, and keyboard shortcuts.
\ No newline at end of file
diff --git a/docs/guides/automatic-startup.md b/docs/guides/automatic-startup.md
new file mode 100644
index 000000000..6af989f7e
--- /dev/null
+++ b/docs/guides/automatic-startup.md
@@ -0,0 +1,353 @@
+# Automatic Startup Guide - Overtime.ag Collection Service
+
+## Quick Setup (Automated)
+
+### Step 1: Import Task to Task Scheduler
+
+Open **PowerShell as Administrator** and run:
+
+```powershell
+# Import the pre-configured task
+Register-ScheduledTask -Xml (Get-Content "C:\Users\omall\Documents\python_projects\sports-betting-edge\scripts\OvertimeCollectorTask.xml" | Out-String) -TaskName "OvertimeCollector" -Force
+
+# Verify it was created
+Get-ScheduledTask -TaskName "OvertimeCollector"
+```
+
+**That's it!** The service will now:
+- Start automatically 2 minutes after boot
+- Run continuously collecting odds every 15 minutes
+- Restart on failure (up to 3 times)
+- Log to `logs/overtime_service_YYYYMMDD.log`
+
+### Step 2: Verify Setup
+
+```powershell
+# Check task status
+Get-ScheduledTask -TaskName "OvertimeCollector" | Select-Object TaskName,State,LastRunTime,NextRunTime
+
+# Start it manually to test
+Start-ScheduledTask -TaskName "OvertimeCollector"
+
+# Wait a few seconds, then check logs
+Get-Content "C:\Users\omall\Documents\python_projects\sports-betting-edge\logs\overtime_service_$(Get-Date -Format 'yyyyMMdd').log" -Tail 20
+```
+
+### Step 3: Monitor
+
+```powershell
+# View running task
+Get-ScheduledTask -TaskName "OvertimeCollector" | Get-ScheduledTaskInfo
+
+# View recent log entries
+Get-Content "logs\overtime_service_$(Get-Date -Format 'yyyyMMdd').log" -Tail 50 -Wait
+
+# View collected data
+uv run python scripts/view_collected_odds.py
+```
+
+---
+
+## Manual Setup (If Automated Fails)
+
+### Option 1: Task Scheduler GUI
+
+1. Open **Task Scheduler** (search in Start menu)
+
+2. Click **"Create Task"** (not "Create Basic Task")
+
+3. **General Tab**:
+ - Name: `OvertimeCollector`
+ - Description: `Collects overtime.ag sports betting odds every 15 minutes`
+ - ✓ Run whether user is logged on or not
+ - ✓ Run with highest privileges (optional)
+ - Configure for: Windows 10
+
+4. **Triggers Tab**:
+ - Click **New...**
+ - Begin the task: **At startup**
+ - Delay task for: **2 minutes**
+ - ✓ Enabled
+ - Click **OK**
+
+5. **Actions Tab**:
+ - Click **New...**
+ - Action: **Start a program**
+ - Program/script: `powershell.exe`
+ - Add arguments:
+ ```
+ -NoProfile -ExecutionPolicy Bypass -File "C:\Users\omall\Documents\python_projects\sports-betting-edge\scripts\start_overtime_service.ps1"
+ ```
+ - Start in: `C:\Users\omall\Documents\python_projects\sports-betting-edge`
+ - Click **OK**
+
+6. **Conditions Tab**:
+ - ✓ Start only if the following network connection is available: Any connection
+ - ☐ Stop if the computer switches to battery power
+
+7. **Settings Tab**:
+ - ✓ Allow task to be run on demand
+ - ✓ Run task as soon as possible after a scheduled start is missed
+ - ✓ If the task fails, restart every: **1 minute** (Attempt to restart up to: **3 times**)
+ - If the running task does not end when requested: **Do not stop**
+ - If the task is already running: **Do not start a new instance**
+
+8. Click **OK**
+
+9. Enter your Windows password if prompted
+
+### Option 2: PowerShell Command
+
+Run this in **PowerShell as Administrator**:
+
+```powershell
+$action = New-ScheduledTaskAction `
+ -Execute "powershell.exe" `
+ -Argument '-NoProfile -ExecutionPolicy Bypass -File "C:\Users\omall\Documents\python_projects\sports-betting-edge\scripts\start_overtime_service.ps1"' `
+ -WorkingDirectory "C:\Users\omall\Documents\python_projects\sports-betting-edge"
+
+$trigger = New-ScheduledTaskTrigger -AtStartup
+
+$settings = New-ScheduledTaskSettingsSet `
+ -AllowStartIfOnBatteries `
+ -DontStopIfGoingOnBatteries `
+ -StartWhenAvailable `
+ -RunOnlyIfNetworkAvailable `
+ -RestartCount 3 `
+ -RestartInterval (New-TimeSpan -Minutes 1)
+
+$principal = New-ScheduledTaskPrincipal `
+ -UserId "$env:USERDOMAIN\$env:USERNAME" `
+ -LogonType Interactive `
+ -RunLevel Limited
+
+Register-ScheduledTask `
+ -TaskName "OvertimeCollector" `
+ -Action $action `
+ -Trigger $trigger `
+ -Settings $settings `
+ -Principal $principal `
+ -Description "Collects overtime.ag sports betting odds every 15 minutes for NCAA Men's Basketball analysis"
+```
+
+---
+
+## Managing the Service
+
+### Start/Stop/Status
+
+```powershell
+# Start
+Start-ScheduledTask -TaskName "OvertimeCollector"
+
+# Stop
+Stop-ScheduledTask -TaskName "OvertimeCollector"
+
+# Status
+Get-ScheduledTask -TaskName "OvertimeCollector" | Format-List *
+
+# Check if running
+Get-ScheduledTask -TaskName "OvertimeCollector" | Get-ScheduledTaskInfo
+```
+
+### View Logs
+
+```powershell
+# Today's log
+Get-Content "logs\overtime_service_$(Get-Date -Format 'yyyyMMdd').log"
+
+# Follow live (like tail -f)
+Get-Content "logs\overtime_service_$(Get-Date -Format 'yyyyMMdd').log" -Wait -Tail 20
+
+# Last 50 lines
+Get-Content "logs\overtime_service_$(Get-Date -Format 'yyyyMMdd').log" -Tail 50
+```
+
+### Modify Configuration
+
+Edit `scripts\start_overtime_service.ps1` to change:
+
+```powershell
+$env:COLLECTION_INTERVAL = "15" # Change to 10, 30, 60, etc.
+$env:COLLECTION_SPORTS = "Basketball,Football" # Add more sports
+$env:COLLECTION_SUBTYPES = "College Basketball,NFL" # Match sports
+```
+
+Then restart the task:
+```powershell
+Stop-ScheduledTask -TaskName "OvertimeCollector"
+Start-ScheduledTask -TaskName "OvertimeCollector"
+```
+
+### Disable (Without Deleting)
+
+```powershell
+Disable-ScheduledTask -TaskName "OvertimeCollector"
+```
+
+Re-enable:
+```powershell
+Enable-ScheduledTask -TaskName "OvertimeCollector"
+```
+
+### Remove Completely
+
+```powershell
+Unregister-ScheduledTask -TaskName "OvertimeCollector" -Confirm:$false
+```
+
+---
+
+## Troubleshooting
+
+### Task Won't Start
+
+1. **Check Task Scheduler permissions**:
+ ```powershell
+ # Run as Administrator
+ Get-ScheduledTask -TaskName "OvertimeCollector" | Get-ScheduledTaskInfo
+ ```
+
+2. **Verify PowerShell execution policy**:
+ ```powershell
+ Get-ExecutionPolicy
+ # Should be RemoteSigned or Unrestricted
+ ```
+
+3. **Test the script manually**:
+ ```powershell
+ & "C:\Users\omall\Documents\python_projects\sports-betting-edge\scripts\start_overtime_service.ps1"
+ ```
+
+### No Data Being Collected
+
+1. **Check logs**:
+ ```powershell
+ Get-Content "logs\overtime_service_*.log" | Select-String "ERROR"
+ ```
+
+2. **Verify network connectivity**:
+ ```powershell
+ Test-NetConnection overtime.ag -Port 443
+ ```
+
+3. **Check data directory**:
+ ```powershell
+ Get-ChildItem "data\overtime\basketball" -Recurse
+ ```
+
+### High Resource Usage
+
+If the service uses too much CPU/memory:
+
+1. **Increase collection interval**:
+ Edit `scripts\start_overtime_service.ps1`:
+ ```powershell
+ $env:COLLECTION_INTERVAL = "30" # or 60
+ ```
+
+2. **Reduce sports collected**:
+ ```powershell
+ $env:COLLECTION_SPORTS = "Basketball" # Only one sport
+ ```
+
+---
+
+## What Happens at Startup
+
+1. **Computer boots** → Wait 2 minutes
+2. **Task Scheduler** starts `start_overtime_service.ps1`
+3. **Script** sets environment variables
+4. **Service** starts collecting odds every 15 minutes
+5. **Logs** written to `logs/overtime_service_YYYYMMDD.log`
+6. **Data** saved to `data/overtime/basketball/*.parquet`
+
+## Verification Checklist
+
+After setup, verify:
+
+- [ ] Task exists: `Get-ScheduledTask -TaskName "OvertimeCollector"`
+- [ ] Task is enabled: State should be "Ready"
+- [ ] Service is running: Check logs for recent collections
+- [ ] Data is being saved: `ls data\overtime\basketball`
+- [ ] No errors in logs: `Get-Content logs\*.log | Select-String "ERROR"`
+
+## Daily Maintenance
+
+### View Stats
+
+```powershell
+# Collections today
+$logFile = "logs\overtime_service_$(Get-Date -Format 'yyyyMMdd').log"
+$collections = Select-String "Collection complete" $logFile
+Write-Host "Collections today: $($collections.Count)"
+
+# Total games collected today
+$games = Select-String "games saved" $logFile
+$totalGames = ($games | ForEach-Object { [regex]::Match($_, '\d+').Value } | Measure-Object -Sum).Sum
+Write-Host "Total games: $totalGames"
+```
+
+### Weekly Cleanup (Optional)
+
+Delete old logs and data:
+
+```powershell
+# Delete logs older than 7 days
+Get-ChildItem "logs\*.log" | Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-7) } | Remove-Item
+
+# Delete Parquet files older than 30 days
+Get-ChildItem "data\overtime" -Recurse -Filter "*.parquet" |
+ Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-30) } |
+ Remove-Item
+```
+
+---
+
+## Success Indicators
+
+You know it's working when:
+
+1. **Task Scheduler** shows:
+ - Last Run Time: Recent timestamp
+ - Last Run Result: Success (0x0)
+ - Next Run Time: At startup
+
+2. **Logs** show:
+ ```
+ INFO:__main__:[OK] Basketball - College Basketball: 6 games saved
+ INFO:__main__:Collection complete: 6 games in 0.4s
+ INFO:__main__:Next collection in 15 minutes
+ ```
+
+3. **Data directory** grows:
+ ```
+ data/overtime/basketball/
+ ├── college_basketball_20260203_014101.parquet
+ ├── college_basketball_20260203_014524.parquet
+ ├── college_basketball_20260203_020000.parquet
+ └── ...
+ ```
+
+4. **View script** shows recent odds:
+ ```powershell
+ uv run python scripts/view_collected_odds.py
+ # Shows games with current lines
+ ```
+
+---
+
+## Next Steps After Setup
+
+1. **Let it run for 24 hours** - Build up historical data
+2. **Check for line movements**:
+ ```powershell
+ uv run python scripts/view_collected_odds.py
+ # Look for "LINE MOVEMENTS" section
+ ```
+3. **Invoke `/normalize-odds`** - Add implied probability calculations
+4. **Create overtime adapter** - `/new_adapter overtime_ag`
+5. **Integrate with KenPom** - Cross-reference team efficiency
+6. **Build alerts** - Detect steam moves and sharp money
+
+Your odds collection service is now running 24/7! 🎉
diff --git a/docs/guides/betting-tracker.md b/docs/guides/betting-tracker.md
new file mode 100644
index 000000000..4e889decd
--- /dev/null
+++ b/docs/guides/betting-tracker.md
@@ -0,0 +1,334 @@
+# Betting Tracker Guide
+
+Comprehensive guide for tracking betting performance on spreads and totals predictions.
+
+## Overview
+
+The betting tracker system consists of three main components:
+
+1. **betting_tracker.py** - Core tracking logic for bet outcomes and ROI
+2. **enter_results.py** - Interactive/batch result entry
+3. **betting_dashboard.py** - Detailed performance analytics
+
+## Quick Start
+
+### 1. Track Initial Predictions
+
+Start with your predictions file (e.g., `combined_predictions_2026-02-03.csv`):
+
+```powershell
+uv run python scripts/betting_tracker.py
+```
+
+This shows pending games and initializes tracking.
+
+### 2. Enter Game Results
+
+#### Interactive Entry
+
+```powershell
+uv run python scripts/enter_results.py
+```
+
+Prompts you for scores for each pending game.
+
+#### Batch Import
+
+Create a CSV with game results:
+
+```csv
+Away_Team,Home_Team,Away_Score,Home_Score
+Miami Ohio,Buffalo,72,85
+Akron,Eastern Michigan,68,82
+```
+
+Then import:
+
+```powershell
+uv run python scripts/enter_results.py --import-csv data/results/scores_2026-02-03.csv
+```
+
+### 3. View Dashboard
+
+```powershell
+uv run python scripts/betting_dashboard.py data/analysis/combined_predictions_2026-02-03_tracked_*.csv
+```
+
+## Command Reference
+
+### betting_tracker.py
+
+Basic tracker with summary stats.
+
+```powershell
+# Show pending games and current stats
+uv run python scripts/betting_tracker.py
+
+# Custom unit size
+uv run python scripts/betting_tracker.py --unit-size 50
+```
+
+### enter_results.py
+
+Enter or import game results.
+
+```powershell
+# Interactive entry
+uv run python scripts/enter_results.py
+
+# Batch import
+uv run python scripts/enter_results.py --import-csv scores.csv
+
+# Custom predictions file
+uv run python scripts/enter_results.py --predictions data/analysis/predictions.csv
+
+# Show summary after entry
+uv run python scripts/enter_results.py --show-summary
+
+# Specify output location
+uv run python scripts/enter_results.py --output data/tracked/results.csv
+```
+
+### betting_dashboard.py
+
+Detailed performance analytics.
+
+```powershell
+# View dashboard
+uv run python scripts/betting_dashboard.py data/analysis/tracked_results.csv
+
+# Export detailed analysis
+uv run python scripts/betting_dashboard.py data/analysis/tracked_results.csv --export data/analysis/detailed_report.csv
+```
+
+## Workflow Example
+
+Complete workflow for a day's games:
+
+```powershell
+# 1. Generate predictions (your existing workflow)
+uv run python scripts/prediction/deploy_today_predictions.py
+
+# 2. Monitor games as they complete
+# Wait for games to finish...
+
+# 3. Enter results interactively
+uv run python scripts/enter_results.py --show-summary
+
+# 4. View detailed analytics
+uv run python scripts/betting_dashboard.py data/analysis/combined_predictions_2026-02-03_tracked_*.csv
+```
+
+## Output Files
+
+### Tracked Results CSV
+
+Created by `betting_tracker.py` or `enter_results.py`:
+
+- Original prediction columns
+- `Away_Score`, `Home_Score` - Final scores
+- `Actual_Total`, `Actual_Margin` - Calculated values
+- `Spread_Result` - Win/Loss/Push for spread bet
+- `Total_Result` - Win/Loss/Push for total bet
+- `Spread_Profit`, `Total_Profit` - Dollar profit/loss per bet
+
+### Detailed Analysis CSV
+
+Created by `betting_dashboard.py --export`:
+
+- All tracked result columns
+- `Spread_Profitable`, `Total_Profitable` - Boolean flags
+- `Spread_Edge_Category`, `Total_Edge_Category` - Edge buckets (0-2, 2-4, 4-6, 6-8, 8+)
+
+## Key Metrics
+
+### Win Rate
+
+Percentage of bets won (excluding pushes):
+
+```
+Win Rate = Wins / (Wins + Losses)
+```
+
+Break-even at standard -110 juice: 52.38%
+
+### ROI (Return on Investment)
+
+```
+ROI = (Profit / Amount Wagered) * 100
+```
+
+- Positive ROI = profitable
+- Target: +5% or higher long-term
+
+### Edge Threshold Analysis
+
+Performance at different edge levels:
+
+- **Edge >= 0**: All plays
+- **Edge >= 2**: Small edge threshold
+- **Edge >= 4**: Moderate edge
+- **Edge >= 6**: Strong edge
+- **Edge >= 8**: Very strong edge
+- **Edge >= 10**: Extreme edge
+
+Higher edge thresholds should show better ROI (if model is calibrated).
+
+### Closing Line Value (CLV)
+
+Compare your predicted edge to actual game outcomes:
+
+- Positive CLV = beating the closing line
+- Success metric for long-term profitability
+
+## Understanding Results
+
+### Spread Result Calculation
+
+Spread bets are graded based on the adjusted margin:
+
+```
+Adjusted Margin = Actual Margin + Spread
+```
+
+- **Win**: Adjusted margin > 0
+- **Loss**: Adjusted margin < 0
+- **Push**: Adjusted margin = 0
+
+Example: Duke -7 beats UNC by 10 points
+- Actual margin: +10 (Duke wins by 10)
+- Adjusted margin: 10 + (-7) = +3
+- Result: WIN
+
+### Total Result Calculation
+
+Total bets compare actual total to market line:
+
+```
+Difference = Actual Total - Market Total
+```
+
+- **Over wins**: Difference > 0
+- **Under wins**: Difference < 0
+- **Push**: Difference = 0
+
+Example: Market total 145.5, actual score 148
+- Difference: 148 - 145.5 = +2.5
+- Result: OVER wins
+
+## Profit Calculation
+
+Standard -110 juice:
+
+```
+Win: Risk $110 to win $100 (profit = $100)
+Loss: Lose $110
+Push: No profit or loss (refund)
+```
+
+With custom juice (e.g., -115):
+
+```
+Win: Risk $115 to win $100 (profit = $100)
+```
+
+## Best Practices
+
+### 1. Track Every Bet
+
+Enter results for every game, even losses. Complete data enables pattern analysis.
+
+### 2. Use Consistent Unit Sizing
+
+Stick to same unit size for accurate ROI calculation. Default: $100/unit.
+
+### 3. Monitor Edge Thresholds
+
+If low-edge bets underperform, consider raising minimum edge threshold.
+
+### 4. Review KenPom Accuracy
+
+Dashboard shows KenPom prediction accuracy. If consistently off, may need recalibration.
+
+### 5. Analyze Patterns
+
+Check dashboard for profitable patterns:
+- Favorites vs underdogs
+- Overs vs unders
+- High edge vs low edge
+
+### 6. Focus on CLV, Not Win Rate
+
+Short-term win rate variance is normal. Closing Line Value (CLV) is the long-term indicator.
+
+## Troubleshooting
+
+### "Game not found in predictions"
+
+Team names must match exactly. Check for:
+- Extra spaces
+- Different abbreviations (e.g., "Miami OH" vs "Miami Ohio")
+- Encoding issues
+
+### "Missing required columns"
+
+Ensure predictions file has all required fields from your prediction generation script.
+
+### "No results to display"
+
+No games have been graded yet. Use `enter_results.py` to add game scores.
+
+## Advanced Usage
+
+### Custom Juice Per Bookmaker
+
+Modify `betting_tracker.py` to read juice from prediction columns:
+
+```python
+spread_profit = self._calculate_payout(
+ self.unit_size,
+ game.get("Away_Spread_Juice", self.juice)
+)
+```
+
+### Multiple Unit Sizing
+
+For games with different confidence levels, track bet size:
+
+```python
+# Add unit_size column to predictions
+bet_size = game["Unit_Size"] * self.unit_size
+profit = self._calculate_payout(bet_size, juice)
+```
+
+### Closing Line Tracking
+
+Compare opening vs closing spreads/totals to track line movement and CLV.
+
+## Integration with Existing Workflow
+
+This tracker integrates with your current prediction pipeline:
+
+```
+1. KenPom data collection → 2. Model training → 3. Generate predictions
+ (odds-collecting) (walk_forward) (deploy_today)
+ ↓
+4. Predictions file → 5. Track games → 6. Enter results → 7. Analyze
+ (combined_*.csv) (betting_tracker) (enter_results) (dashboard)
+```
+
+## Files Location
+
+Default locations:
+
+- **Predictions**: `data/analysis/combined_predictions_YYYY-MM-DD.csv`
+- **Tracked results**: `data/analysis/combined_predictions_YYYY-MM-DD_tracked_*.csv`
+- **Scores import**: `data/results/scores_YYYY-MM-DD.csv`
+- **Detailed reports**: `data/analysis/detailed_report_*.csv`
+
+## Support
+
+For issues or questions:
+1. Check team name matching in predictions vs results
+2. Verify CSV format for batch imports
+3. Review logs for detailed error messages
diff --git a/docs/guides/github-ci-guide.md b/docs/guides/github-ci-guide.md
new file mode 100644
index 000000000..1ebe5c448
--- /dev/null
+++ b/docs/guides/github-ci-guide.md
@@ -0,0 +1,196 @@
+# GitHub Workflow
+
+Use the `gh` CLI to manage commits, PRs, issues, and Actions for this project.
+
+## Setup
+
+### Install gh CLI
+
+**Windows (winget):**
+```powershell
+winget install GitHub.cli
+```
+
+**macOS:**
+```bash
+brew install gh
+```
+
+**Linux (Debian/Ubuntu):**
+```bash
+sudo apt install gh
+```
+
+### Authenticate
+
+```powershell
+gh auth login
+```
+
+Follow prompts (browser or token). Verify with:
+
+```powershell
+gh auth status
+gh --version
+```
+
+When run from this repo, `gh` infers the repository from the current directory. For scripts or when not in the repo: `gh --repo $(./scripts/gh-repo.sh)`.
+
+---
+
+## Commits
+
+### Committer Script
+
+Use the scoped commit helper to stage only specified files:
+
+**PowerShell:**
+```powershell
+.\scripts\committer.ps1 "feat: add verbose flag" src\cli.py tools\check.py
+```
+
+**Bash / just:**
+```bash
+./scripts/committer.sh "feat: add verbose flag" src/cli.py tools/check.py
+# or
+just commit "feat: add verbose flag" src/cli.py tools/check.py
+```
+
+The script resets staging first so only the listed files are included.
+
+### Conventional Commits
+
+Use concise, action-oriented messages:
+
+```
+:
+```
+
+Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`
+
+Examples:
+- `feat: implement batch export`
+- `fix: resolve null reference in parser`
+- `docs: update API authentication guide`
+
+### Grouping
+
+- Group related changes in a single commit
+- Each commit should be a logical unit
+- Avoid bundling unrelated refactors
+
+---
+
+## Pull Requests
+
+### Creating PRs
+
+PRs should include:
+1. **Scope summary**: What changes
+2. **Testing**: How it was validated
+3. **User-facing changes**: New flags, behavior changes, breaking changes
+
+Example description:
+```markdown
+## Summary
+Add verbose flag to CLI for debugging output.
+
+## Testing
+- Unit tests added
+- Manual testing with `--verbose` on sample payloads
+
+## User-Facing Changes
+- New `--verbose` flag on send command
+```
+
+### Reviewing PRs
+
+**Do NOT switch branches.** Use read-only commands:
+
+```bash
+gh pr view
+gh pr diff
+gh pr checks
+```
+
+### Landing PRs
+
+1. Ensure clean state: `git switch main` and `git pull`
+2. Create temp branch: `git switch -c integrate-pr-`
+3. Fetch PR: `gh pr checkout --detach`
+4. Merge (squash or rebase depending on history)
+5. Run gate check: `uv run python tools/check.py`
+6. Merge to main, push, cleanup
+
+When squashing, add co-author attribution:
+```
+feat: implement feature X (#123)
+
+Co-authored-by: Contributor Name
+```
+
+---
+
+## Issues
+
+```bash
+gh issue list --state open
+gh issue view
+gh issue create
+```
+
+---
+
+## Actions (CI/CD)
+
+```bash
+gh run list --limit 10
+gh run view
+gh run view --log-failed
+gh run watch
+```
+
+---
+
+## Sync Workflow
+
+1. If dirty: commit first with a sensible message
+2. Pull with rebase: `git pull --rebase`
+3. Resolve conflicts if any; do not force push
+4. Push: `git push`
+
+Or use: `just gh-sync`
+
+---
+
+## Changelog
+
+- Keep latest released version at top
+- Entry format: `- feat: description (#123) - thanks @contributor`
+- Reference issues: `fixes #456`
+
+---
+
+## Quick Reference
+
+| Task | Command |
+|------|---------|
+| Auth status | `gh auth status` or `just gh-status` |
+| List open PRs | `gh pr list` or `just gh-pr-list` |
+| View PR | `gh pr view ` or `just gh-pr-view ` |
+| PR CI checks | `gh pr checks ` or `just gh-pr-checks ` |
+| List issues | `gh issue list` or `just gh-issue-list` |
+| Workflow runs | `gh run list` or `just gh-run-list` |
+| Sync (rebase + push) | `just gh-sync` |
+| Scoped commit | `just commit "msg" file1 file2` |
+
+---
+
+## Review vs Land
+
+| Aspect | Review | Land |
+|--------|--------|------|
+| Switch branches | NO | YES (temp branch) |
+| Modify code | NO | YES |
+| Tools | `gh pr view`, `gh pr diff` | `git merge`, `git rebase` |
+| End state | Stay on current branch | Return to main |
diff --git a/docs/guides/model-development.md b/docs/guides/model-development.md
new file mode 100644
index 000000000..0828b5165
--- /dev/null
+++ b/docs/guides/model-development.md
@@ -0,0 +1,235 @@
+# ML Pipeline Integration - Odds API + KenPom
+
+## Summary
+
+Successfully integrated Odds API SQLite database (3.5GB, 10.4M+ observations) with KenPom efficiency metrics into XGBoost training pipeline for spreads and totals prediction.
+
+## Data Pipeline
+
+```
+Odds API SQLite (10.4M observations)
+ |
+ v
+Normalized Views (canonical_spreads, canonical_totals)
+ |
+ v
+FeatureEngineer Service
+ |
+ +-- KenPom Ratings (AdjEM, AdjOE, AdjDE, AdjTempo, SOS, Luck)
+ +-- KenPom Four Factors (eFG%, TO%, OR%, FT Rate, etc.)
+ +-- Line Movement (opening, closing, movement)
+ +-- Team Mapper (fuzzy matching across sources)
+ |
+ v
+Training Datasets (Parquet)
+ |
+ v
+XGBoost Models
+```
+
+## Dataset Statistics
+
+### Spreads Dataset
+- **Games**: 390 (2025-12-28 to 2026-01-29)
+- **Features**: 30
+ - KenPom metrics (11 per team): AdjEM, AdjOE, AdjDE, AdjTempo, SOS, Luck, eFG%, TO%, OR%, FT Rate, DeFG%, DTO%
+ - Matchup features (3): em_diff, fav_o_vs_dog_d, dog_o_vs_fav_d
+ - Line features (3): opening_spread, closing_spread, line_movement
+- **Target**: favorite_covered (35.9% covered rate - 140/390)
+- **File**: `data/ml/spreads_2025-12-28_2026-01-29.parquet`
+
+### Totals Dataset
+- **Games**: 366 (2025-12-28 to 2026-01-29)
+- **Features**: 31
+ - KenPom metrics (11 per team): Same as spreads
+ - Matchup features (4): avg_tempo, tempo_diff, total_offense, total_defense
+ - Line features (3): opening_total, closing_total, total_movement
+- **Target**: went_over (37.4% over rate - 137/366)
+- **File**: `data/ml/totals_2025-12-28_2026-01-29.parquet`
+
+## Team Name Mapping
+
+Created fuzzy-matched team name mapping across KenPom, ESPN, and Odds API:
+- **Total teams mapped**: 365
+- **Odds API match rate**: 95.1% (347/365)
+- **ESPN match rate**: 30.7% (112/365)
+- **Methodology**: Multi-stage matching (exact -> normalized -> prefix -> fuzzy with validation)
+- **File**: `data/team_mapping.parquet`
+
+### Sample Verified Mappings
+| KenPom | Odds API | Match Score |
+|--------|----------|-------------|
+| Duke | Duke Blue Devils | 95 |
+| Kentucky | Kentucky Wildcats | 95 |
+| Kansas | Kansas Jayhawks | 95 |
+| North Carolina | North Carolina Tar Heels | 95 |
+| Gonzaga | Gonzaga Bulldogs | 95 |
+
+## Baseline Model Performance
+
+### Spreads Model
+```
+Train Accuracy: 93.91%
+Test Accuracy: 56.41%
+Train LogLoss: 0.1648
+Test LogLoss: 0.8657
+Train AUC: 0.9918
+Test AUC: 0.5061
+```
+
+**Top Features**:
+1. fav_adj_o (0.0564) - Favorite offensive efficiency
+2. dog_luck (0.0512) - Underdog luck rating
+3. dog_sos (0.0411) - Underdog strength of schedule
+4. dog_adj_d (0.0402) - Underdog defensive efficiency
+5. dog_o_vs_fav_d (0.0396) - Matchup: underdog offense vs favorite defense
+
+### Totals Model
+```
+Train Accuracy: 90.75%
+Test Accuracy: 51.35%
+Train LogLoss: 0.1914
+Test LogLoss: 0.7406
+Train AUC: 0.9807
+Test AUC: 0.5939
+```
+
+**Top Features**:
+1. total_defense (0.0918) - Combined defensive efficiency
+2. home_defg_pct (0.0607) - Home defensive eFG%
+3. away_luck (0.0584) - Away team luck rating
+4. away_dto_pct (0.0582) - Away defensive turnover%
+5. away_adj_em (0.0438) - Away efficiency margin
+
+## Analysis
+
+### Overfitting Detected
+Both models show severe overfitting:
+- **Spreads**: 93.91% train vs 56.41% test accuracy
+- **Totals**: 90.75% train vs 51.35% test accuracy
+
+This is expected with:
+- Small dataset (390/366 games)
+- High feature count (30/31 features)
+- Default XGBoost hyperparameters
+
+Test AUC scores near 0.5 (random) indicate models have not learned generalizable patterns.
+
+### Feature Importance Insights
+
+**Spreads**:
+- Offensive efficiency (fav_adj_o) most important
+- Luck and SOS ratings significant
+- Line movement features ranked low (need more data)
+
+**Totals**:
+- Defensive metrics dominate (total_defense, defg_pct)
+- Luck ratings important
+- Line movement (total_movement) appears in top 10
+- Tempo features (avg_tempo) less important than expected
+
+## Next Steps for Improvement
+
+### 1. More Training Data
+- Current: 1 month (Dec 28 - Jan 29)
+- Target: Full season (Nov - Mar = ~5,000 games)
+- Action: Continue collecting Odds API data through season
+
+### 2. Regularization
+- Current: Default XGBoost params (max_depth=5, n_estimators=100)
+- Try:
+ - Reduce max_depth (3-4)
+ - Increase min_child_weight (5-10)
+ - Add regularization (reg_alpha, reg_lambda)
+ - Reduce feature set (drop low-importance features)
+
+### 3. Cross-Validation
+- Current: Single 80/20 train/test split
+- Action: Implement k-fold cross-validation for robust evaluation
+- Consider time-based splits (train on older games, test on recent)
+
+### 4. Additional Features
+- Advanced line movement: steam moves, RLM, consensus
+- Bookmaker-specific features (sharp vs square books)
+- Recent form (last 5 games performance)
+- Home court advantage metrics
+- Injury/roster data
+
+### 5. Alternative Models
+- Logistic regression (interpretable baseline)
+- LightGBM (faster, sometimes better than XGBoost)
+- Neural networks (if dataset grows to 5K+ games)
+
+### 6. Evaluation Against Market
+- **Critical metric**: Closing Line Value (CLV), not accuracy
+- Track ROI on bets placed vs closing lines
+- Measure calibration (predicted probabilities vs actual outcomes)
+
+## File Structure
+
+```
+data/
+ ml/
+ spreads_2025-12-28_2026-01-29.parquet # Spreads training data
+ totals_2025-12-28_2026-01-29.parquet # Totals training data
+ team_mapping.parquet # Cross-source team names
+ odds_api/
+ odds_api.sqlite3 # Raw odds data (3.5GB)
+
+models/
+ spreads_model.json # Trained spreads model
+ totals_model.json # Trained totals model
+
+scripts/
+ build_training_datasets.py # Generate ML datasets
+ train_spreads_model.py # Train spreads model
+ train_totals_model.py # Train totals model
+ create_team_mapping.py # Generate team mappings
+
+src/sports_betting_edge/
+ adapters/
+ odds_api_db.py # Odds API database adapter
+ services/
+ feature_engineering.py # Feature extraction service
+ core/
+ team_mapper.py # Team name lookup helper
+
+sql/
+ create_normalized_views.sql # Database normalization
+```
+
+## Usage
+
+### Build Datasets
+```bash
+uv run python scripts/build_training_datasets.py \
+ --start 2025-12-28 \
+ --end 2026-01-29 \
+ --season 2026 \
+ --output-dir data/ml
+```
+
+### Train Models
+```bash
+# Spreads model
+uv run python scripts/train_spreads_model.py \
+ --data data/ml/spreads_2025-12-28_2026-01-29.parquet \
+ --output models/spreads_model.json
+
+# Totals model
+uv run python scripts/train_totals_model.py \
+ --data data/ml/totals_2025-12-28_2026-01-29.parquet \
+ --output models/totals_model.json
+```
+
+### Update Team Mapping
+```bash
+uv run python scripts/create_team_mapping.py
+```
+
+## References
+
+- Odds API: https://the-odds-api.com/
+- KenPom: https://kenpom.com/
+- Team Mapper: src/sports_betting_edge/core/team_mapper.py
+- Feature Engineering: src/sports_betting_edge/services/feature_engineering.py
diff --git a/docs/guides/odds-api-guide.md b/docs/guides/odds-api-guide.md
new file mode 100644
index 000000000..136a596ef
--- /dev/null
+++ b/docs/guides/odds-api-guide.md
@@ -0,0 +1,320 @@
+# Odds API Streaming Service
+
+Automated line movement tracking for NCAA Men's Basketball with optimal credit usage.
+
+## Overview
+
+The streaming service collects odds from The Odds API at 30-second intervals during betting hours (4 AM - 11 PM PST). This enables:
+
+- **Line movement detection**: Track how spreads and totals move throughout the day
+- **Steam move identification**: Detect sharp money and reverse line movement
+- **Closing Line Value (CLV)**: Compare your picks against closing lines
+- **Market inefficiencies**: Find bookmaker divergence and arbitrage opportunities
+
+## Credit Budget Strategy
+
+Following the `/odds-collecting` skill guidance:
+
+| Metric | Value |
+|--------|-------|
+| **Interval** | 30 seconds (2 calls/min) |
+| **Active hours** | 4 AM - 11 PM PST (19 hours) |
+| **Daily calls** | ~2,280 calls |
+| **Regions** | us, us2 (2 regions) |
+| **Daily credits** | ~4,560 credits |
+| **Monthly usage** | ~137K credits |
+| **Monthly budget** | 5M credits |
+| **Reserved** | ~4.86M for backfills and special events |
+
+This strategy provides high-resolution line tracking while reserving 97% of quota for historical analysis.
+
+## Quick Start
+
+### Option 1: Daemon Mode (Recommended)
+
+Runs continuously, automatically collecting during betting hours:
+
+```powershell
+# Start daemon (runs until you stop it)
+uv run python scripts/stream_odds_api.py --daemon
+```
+
+**For persistent background operation**, use Windows Task Scheduler:
+
+```powershell
+# Set up automated task (runs at boot)
+.\scripts\setup_odds_streaming_windows.ps1
+
+# Start the task
+Start-ScheduledTask -TaskName "OddsAPIStreaming"
+
+# Check status
+Get-ScheduledTask -TaskName "OddsAPIStreaming" | Select-Object State
+
+# View logs
+Get-Content data\logs\odds_api_streaming.log -Tail 50 -Wait
+```
+
+### Option 2: Manual Collection
+
+Run single snapshot (useful for testing or cron jobs):
+
+```powershell
+# Collect once
+uv run python scripts/stream_odds_api.py --once
+
+# Dry run (see what would happen)
+uv run python scripts/stream_odds_api.py --dry-run
+```
+
+## Collection Schedule
+
+| Time (PST) | Status | Rationale |
+|------------|--------|-----------|
+| 4:00 AM - 11:00 PM | ✅ Active | Betting hours for next-day games |
+| 11:00 PM - 4:00 AM | ❌ Inactive | Saves credits, minimal line movement |
+
+Collection stops automatically outside betting hours and resumes at 4 AM.
+
+## Data Storage
+
+### Database Schema
+
+**Location**: `data/odds_api/odds_api.sqlite3`
+
+```sql
+-- Events (one row per game)
+CREATE TABLE events (
+ event_id TEXT PRIMARY KEY,
+ sport_key TEXT,
+ home_team TEXT,
+ away_team TEXT,
+ commence_time TEXT,
+ created_at TEXT,
+ source TEXT,
+ has_odds INTEGER
+);
+
+-- Observations (raw odds snapshots)
+CREATE TABLE observations (
+ obs_id INTEGER PRIMARY KEY AUTOINCREMENT,
+ event_id TEXT,
+ book_key TEXT,
+ market_key TEXT,
+ outcome_name TEXT,
+ price_american INTEGER,
+ point REAL,
+ as_of TEXT,
+ fetched_at TEXT,
+ sport_key TEXT
+);
+
+-- Scores (final results)
+CREATE TABLE scores (
+ event_id TEXT PRIMARY KEY,
+ sport_key TEXT,
+ completed INTEGER,
+ home_score INTEGER,
+ away_score INTEGER,
+ last_update TEXT,
+ fetched_at TEXT
+);
+```
+
+### Normalized Views
+
+The database automatically creates canonical views that follow betting data normalization standards:
+
+- `canonical_spreads`: One row per game/book/time with spread magnitude (always positive) and favorite/underdog teams
+- `canonical_totals`: One row per game/book/time with total value
+- `canonical_moneylines`: Moneylines with implied probabilities
+- `spread_movements`: Line movement tracking with steam detection
+- `bookmaker_consensus`: Market consensus across all books
+
+**Example Query** (get closing spreads for a game):
+
+```sql
+SELECT
+ book_key,
+ favorite_team,
+ underdog_team,
+ spread_magnitude,
+ favorite_price,
+ underdog_price
+FROM canonical_spreads
+WHERE event_id = 'abc123'
+ORDER BY as_of DESC
+LIMIT 1;
+```
+
+## Verification & Monitoring
+
+### Check Latest Collection
+
+```powershell
+# View recent log entries
+Get-Content data\logs\odds_api_streaming.log -Tail 20
+
+# Run verification script
+uv run python scripts/verify_odds_streaming.py
+```
+
+### Database Queries
+
+```python
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+db = OddsAPIDatabase("data/odds_api/odds_api.sqlite3")
+
+# Get coverage stats
+stats = db.get_database_stats()
+print(f"Total events: {stats['total_events']}")
+print(f"Events with scores: {stats['events_with_scores']}")
+
+# Get line movements for a game
+movements = db.get_spread_movements(event_id="your_event_id")
+print(movements)
+
+# Get closing lines for today's games
+from datetime import date
+closing = db.get_bookmaker_closing_lines(
+ event_ids=["event1", "event2"],
+ book_keys=["fanduel", "draftkings", "pinnacle"]
+)
+print(closing)
+```
+
+### Quota Monitoring
+
+The service automatically logs quota warnings when usage exceeds 80%:
+
+```
+[WARNING] Quota below 20%: 950,000 remaining (19.0%)
+```
+
+Check current quota:
+
+```python
+from sports_betting_edge.adapters.odds_api import OddsAPIAdapter
+import asyncio
+
+async def check_quota():
+ adapter = OddsAPIAdapter()
+ await adapter.get_sports() # Make any API call
+ remaining = adapter.get_quota_remaining()
+ used = adapter.get_quota_used()
+ print(f"Remaining: {remaining:,}")
+ print(f"Used: {used:,}")
+ await adapter.close()
+
+asyncio.run(check_quota())
+```
+
+## Troubleshooting
+
+### Issue: "ODDS_API_KEY not set"
+
+**Solution**: Add your API key to `.env` file:
+
+```bash
+ODDS_API_KEY=your_key_here
+```
+
+### Issue: "No odds data returned"
+
+**Possible causes**:
+- No games scheduled for today
+- Collecting outside betting hours
+- API quota exhausted
+
+**Check**:
+```powershell
+# Test single collection
+uv run python scripts/stream_odds_api.py --once
+```
+
+### Issue: "Task not running in background"
+
+**Solution**: Check Task Scheduler status:
+
+```powershell
+Get-ScheduledTask -TaskName "OddsAPIStreaming" | Select-Object State
+
+# If stopped, start it
+Start-ScheduledTask -TaskName "OddsAPIStreaming"
+```
+
+### Issue: "Database locked"
+
+**Cause**: Multiple processes accessing database simultaneously
+
+**Solution**:
+- Ensure only one streaming daemon is running
+- Stop Task Scheduler task before manual collection
+- Use `--db` flag to specify different database for testing
+
+## Advanced Usage
+
+### Custom Database Path
+
+```powershell
+# Use different database for testing
+uv run python scripts/stream_odds_api.py --daemon --db data/test.sqlite3
+```
+
+### Manual Interval Control
+
+Edit `scripts/stream_odds_api.py`:
+
+```python
+COLLECTION_INTERVAL_SECONDS = 60 # Change from 30 to 60 seconds
+```
+
+**Trade-off**: Longer intervals save credits but reduce line movement resolution.
+
+### Custom Time Window
+
+Edit `scripts/stream_odds_api.py`:
+
+```python
+COLLECTION_START_TIME = dt_time(10, 0) # Start at 10 AM instead of 4 AM
+COLLECTION_END_TIME = dt_time(22, 0) # End at 10 PM instead of 11 PM
+```
+
+## Line Movement Analysis
+
+See `scripts/analyze_line_movement.py` for detecting:
+
+- **Steam moves**: Rapid line movement across multiple books
+- **Reverse Line Movement (RLM)**: Line moves opposite to public betting %
+- **Key number violations**: Movements through 3, 7 (NFL), 1-2 (NCAAB)
+- **Bookmaker divergence**: Sharp vs public book disagreements
+
+## Integration with ML Models
+
+The streaming data feeds into your prediction models:
+
+```python
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+db = OddsAPIDatabase("data/odds_api/odds_api.sqlite3")
+
+# Get line movement features for model
+features = db.get_line_movement_features(event_ids=["event1", "event2"])
+
+# Columns: avg_opening_spread, avg_closing_spread, avg_total_movement,
+# hours_tracked, movement_velocity
+
+# Get consensus divergence (market inefficiency indicator)
+divergence = db.get_consensus_divergence(event_ids=["event1", "event2"])
+
+# Columns: consensus_spread, spread_variance, has_market_disagreement,
+# outlier_book_count, num_books
+```
+
+## See Also
+
+- `/odds-collecting` skill: Complete Odds API reference
+- `docs/kenpom/`: KenPom data collection
+- `scripts/collect_hybrid.py`: ESPN + Odds API hybrid collection
+- `sql/create_normalized_views.sql`: View definitions
diff --git a/docs/guides/overtime-guide.md b/docs/guides/overtime-guide.md
new file mode 100644
index 000000000..3a7ee00b9
--- /dev/null
+++ b/docs/guides/overtime-guide.md
@@ -0,0 +1,361 @@
+# Overtime.ag Collection Service Guide
+
+## Quick Start
+
+```bash
+# Install dependencies
+uv add pandas pyarrow
+
+# Run the service
+uv run python scripts/overtime_collector_service.py
+```
+
+The service will:
+- Collect College Basketball odds immediately
+- Append to `data/overtime/college_basketball_odds/YYYY-MM-DD/college_basketball_odds.parquet`
+- Wait 30 minutes
+- Repeat forever (until you stop it with Ctrl+C)
+
+## Configuration
+
+### Environment Variables
+
+Create a `.env` file (copy from `.env.overtime.example`):
+
+```bash
+# Collection frequency
+COLLECTION_INTERVAL=30 # Minutes between collections
+
+# What to collect
+COLLECTION_SPORTS=Basketball,Football
+COLLECTION_SUBTYPES=College Basketball,NFL
+
+# Where to save
+DATA_DIR=data/overtime
+
+# Logging
+LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR
+```
+
+Load with:
+```bash
+# Windows PowerShell
+Get-Content .env | ForEach-Object {
+ if ($_ -match '^([^=]+)=(.*)$') {
+ [Environment]::SetEnvironmentVariable($matches[1], $matches[2], 'Process')
+ }
+}
+uv run python scripts/overtime_collector_service.py
+```
+
+Or use `python-dotenv`:
+```bash
+uv add python-dotenv
+```
+
+Then modify the script to load `.env` automatically.
+
+### Command-line Override
+
+```bash
+# Collect every 15 minutes
+$env:COLLECTION_INTERVAL=15
+uv run python scripts/overtime_collector_service.py
+
+# Change sports
+$env:COLLECTION_SPORTS="Basketball"
+$env:COLLECTION_SUBTYPES="NBA"
+uv run python scripts/overtime_collector_service.py
+```
+
+## Running the Service
+
+### Option 1: Foreground (Development)
+
+```bash
+uv run python scripts/overtime_collector_service.py
+```
+
+**Pros**: See logs in real-time, easy to debug
+**Cons**: Stops when terminal closes
+
+Stop with: `Ctrl+C`
+
+### Option 2: Background (Windows)
+
+#### Using PowerShell Job
+
+```powershell
+# Start in background
+$job = Start-Job -ScriptBlock {
+ Set-Location "C:\Users\omall\Documents\python_projects\sports-betting-edge"
+ uv run python scripts/overtime_collector_service.py
+}
+
+# Check status
+Get-Job
+
+# View output
+Receive-Job -Id $job.Id -Keep
+
+# Stop
+Stop-Job -Id $job.Id
+Remove-Job -Id $job.Id
+```
+
+#### Using Task Scheduler (Production)
+
+1. Open Task Scheduler
+2. Create Basic Task
+ - Name: "Overtime Odds Collector"
+ - Trigger: "At startup"
+ - Action: Start a program
+ - Program: `uv.exe`
+ - Arguments: `run python scripts/overtime_collector_service.py`
+ - Start in: `C:\Users\omall\Documents\python_projects\sports-betting-edge`
+
+3. Settings:
+ - ✓ Allow task to run on demand
+ - ✓ Run task as soon as possible after a scheduled start is missed
+ - ✓ If task fails, restart every: 1 minute
+
+#### Using NSSM (Windows Service)
+
+```powershell
+# Install NSSM (Windows service wrapper)
+choco install nssm
+
+# Create service
+nssm install OvertimeCollector "C:\Users\omall\.local\bin\uv.exe" "run python scripts/overtime_collector_service.py"
+nssm set OvertimeCollector AppDirectory "C:\Users\omall\Documents\python_projects\sports-betting-edge"
+
+# Start service
+nssm start OvertimeCollector
+
+# Stop service
+nssm stop OvertimeCollector
+
+# Remove service
+nssm remove OvertimeCollector confirm
+```
+
+### Option 3: Docker (Cross-platform)
+
+Create `Dockerfile.collector`:
+```dockerfile
+FROM python:3.12-slim
+
+WORKDIR /app
+
+# Install uv
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
+
+# Copy project
+COPY . .
+
+# Install dependencies
+RUN uv sync --frozen
+
+# Run service
+CMD ["uv", "run", "python", "scripts/overtime_collector_service.py"]
+```
+
+Run:
+```bash
+docker build -f Dockerfile.collector -t overtime-collector .
+docker run -d --name overtime-collector \
+ -e COLLECTION_INTERVAL=30 \
+ -v ./data:/app/data \
+ overtime-collector
+
+# View logs
+docker logs -f overtime-collector
+
+# Stop
+docker stop overtime-collector
+```
+
+## Monitoring
+
+### View Collected Data
+
+```python
+import pandas as pd
+from pathlib import Path
+
+# List all daily partitions
+data_dir = Path("data/overtime/college_basketball_odds")
+files = sorted(data_dir.glob("*/college_basketball_odds.parquet"))
+print(f"Found {len(files)} daily partitions")
+
+# Load most recent partition
+latest = pd.read_parquet(files[-1])
+print(latest.head())
+
+# Combine all partitions
+df_all = pd.concat([pd.read_parquet(f) for f in files], ignore_index=True)
+print(f"Total rows: {len(df_all)}")
+```
+
+### Track Line Movements
+
+```python
+import pandas as pd
+
+# Load a daily partition and compare two capture times
+df = pd.read_parquet(
+ "data/overtime/college_basketball_odds/2026-02-02/college_basketball_odds.parquet"
+)
+
+df1 = df[df["collected_at"] == "2026-02-02T20:00:00+00:00"]
+df2 = df[df["collected_at"] == "2026-02-02T20:30:00+00:00"]
+
+# Compare spreads
+comparison = pd.merge(
+ df1[["game_num", "team1_name", "spread_points", "collected_at"]],
+ df2[["game_num", "spread_points", "collected_at"]],
+ on="game_num",
+ suffixes=("_before", "_after")
+)
+
+comparison["spread_movement"] = comparison["spread_points_after"] - comparison["spread_points_before"]
+
+# Games with line movement
+movers = comparison[comparison["spread_movement"] != 0]
+print(f"Games with line movement: {len(movers)}")
+print(movers[["team1_name", "spread_movement"]])
+```
+
+## Service Logs
+
+The service logs:
+- **INFO**: Normal operations (collections started/completed)
+- **WARNING**: No games found, unusual conditions
+- **ERROR**: Collection failures, network issues
+
+Example output:
+```
+2026-02-02 20:00:00 [INFO] __main__: Starting collection at 2026-02-02T20:00:00+00:00
+2026-02-02 20:00:01 [INFO] __main__: Fetching Basketball - College Basketball...
+2026-02-02 20:00:02 [INFO] __main__: [OK] Basketball - College Basketball: 6 games saved
+2026-02-02 20:00:02 [INFO] __main__: Collection complete: 6 games in 2.1s
+2026-02-02 20:00:02 [INFO] __main__: Next collection in 30 minutes (at 20:30:00)
+```
+
+## Troubleshooting
+
+### Service won't start
+
+```bash
+# Check dependencies
+uv sync
+
+# Check if port is blocked (if using web interface)
+netstat -ano | findstr :8080
+
+# Run with debug logging
+$env:LOG_LEVEL="DEBUG"
+uv run python scripts/overtime_collector_service.py
+```
+
+### No data being saved
+
+```bash
+# Check data directory permissions
+ls data/overtime/
+
+# Verify environment variables
+$env:DATA_DIR
+```
+
+### Network errors
+
+The service will automatically retry on the next interval. Check:
+- Internet connection
+- overtime.ag website status
+- Firewall/proxy settings
+
+## Graceful Shutdown
+
+The service handles:
+- **Ctrl+C**: Graceful shutdown with statistics
+- **SIGTERM**: Clean shutdown (Docker, systemd)
+- **SIGINT**: Keyboard interrupt
+
+Statistics shown on shutdown:
+```
+================================================================================
+Service Statistics
+================================================================================
+Collections completed: 48
+Collections failed: 0
+Total games collected: 288
+================================================================================
+```
+
+## Performance
+
+Typical resource usage:
+- **Memory**: 50-100 MB
+- **CPU**: <1% (idle), 5-10% (during collection)
+- **Disk**: ~100 KB per collection (Parquet compressed)
+- **Network**: ~500 KB per collection
+
+## Data Retention
+
+Parquet files accumulate over time. Suggested retention policy:
+
+```python
+from pathlib import Path
+from datetime import datetime, timedelta
+
+def cleanup_old_files(data_dir: Path, days: int = 30) -> None:
+ """Delete daily partitions older than X days."""
+ cutoff = datetime.now() - timedelta(days=days)
+ base_dir = data_dir / "college_basketball_odds"
+
+ for partition in base_dir.iterdir():
+ if not partition.is_dir():
+ continue
+
+ try:
+ partition_date = datetime.strptime(partition.name, "%Y-%m-%d")
+ except ValueError:
+ continue
+
+ if partition_date < cutoff:
+ for file in partition.glob("*.parquet"):
+ file.unlink()
+ partition.rmdir()
+ print(f"Deleted: {partition}")
+
+# Run monthly
+cleanup_old_files(Path("data/overtime"), days=30)
+```
+
+## Next Steps
+
+1. **Add implied probability conversion**
+ - Invoke `/normalize-odds` skill
+ - Implement `american_to_implied_prob()` function
+ - Update `_normalize_games()` to calculate probabilities
+
+2. **Create overtime adapter**
+ ```bash
+ /new_adapter overtime_ag
+ ```
+
+3. **Integrate with KenPom**
+ - Cross-reference team names
+ - Combine efficiency metrics with odds
+ - Calculate expected value
+
+4. **Build alerts**
+ - Detect significant line movements
+ - Compare with opening lines
+ - Identify sharp money signals
+
+5. **Web dashboard**
+ - Visualize line movements
+ - Show current odds
+ - Display historical trends
diff --git a/docs/kenpom/data-structure.md b/docs/kenpom/data-structure.md
new file mode 100644
index 000000000..6810b907c
--- /dev/null
+++ b/docs/kenpom/data-structure.md
@@ -0,0 +1,88 @@
+# KenPom Data Structure
+
+Standard layout for KenPom data in `data/kenpom/`. Uses Parquet for analytics and ML pipelines.
+
+## Directory Layout
+
+```
+data/kenpom/
+├── {dataset-type}/
+│ ├── daily/ # Date-stamped snapshots (when applicable)
+│ │ └── {dataset-type}_{YYYY-MM-DD}.parquet
+│ └── season/ # Season-level data
+│ └── {dataset-type}_{YYYY}.parquet
+```
+
+- **Dataset types**: kebab-case (e.g., `four-factors`, `misc-stats`, `pointdist`)
+- **Daily**: Date-specific collections; filename includes `YYYY-MM-DD`
+- **Season**: Season-level data; filename includes `YYYY` (e.g., 2026 for 2025–26 season)
+
+## File Naming
+
+### Season Data (standard)
+
+| Dataset | Path | Example |
+|---------------|------------------------------------|-----------------------------------|
+| ratings | `ratings/season/ratings_{YYYY}.parquet` | `ratings/season/ratings_2026.parquet` |
+| four-factors | `four-factors/season/four-factors_{YYYY}.parquet` | `four-factors/season/four-factors_2026.parquet` |
+| four-factors (conf only) | `four-factors/season/four-factors_conference_{YYYY}.parquet` | `four-factors/season/four-factors_conference_2026.parquet` |
+| misc-stats | `misc-stats/season/misc-stats_{YYYY}.parquet` | `misc-stats/season/misc-stats_2026.parquet` |
+| misc-stats (conf only) | `misc-stats/season/misc-stats_conference_{YYYY}.parquet` | same pattern |
+| pointdist | `pointdist/season/pointdist_{YYYY}.parquet` | `pointdist/season/pointdist_2026.parquet` |
+| height | `height/season/height_{YYYY}.parquet` | `height/season/height_2026.parquet` |
+| teams | `teams/season/teams_{YYYY}.parquet` | `teams/season/teams_2026.parquet` |
+| conferences | `conferences/season/conferences_{YYYY}.parquet` | `conferences/season/conferences_2026.parquet` |
+| conf-ratings | `conf-ratings/season/conf-ratings_{YYYY}.parquet` | `conf-ratings/season/conf-ratings_2026.parquet` |
+| efficiency | `efficiency/season/efficiency_{YYYY}.parquet` | `efficiency/season/efficiency_2026.parquet` |
+
+### Season Data (special cases)
+
+| Dataset | Path | Example |
+|----------------|-----------------------------------------------|----------------------------------------------|
+| conf-standings | `conf-standings/season/{CONF}_standings_{YYYY}.parquet` | `conf-standings/season/B10_standings_2026.parquet` |
+| game-attribs | `game-attribs/season/game-attribs_{Type}_{YYYY}.parquet` | `game-attribs/season/game-attribs_Excitement_2026.parquet` |
+| player-stats | `player-stats/season/player-stats_{StatType}_{YYYY}.parquet` | `player-stats/season/player-stats_ORtg_2026.parquet` |
+| schedules | `schedules/season/{TeamName}_schedule_{YYYY}.parquet` | `schedules/season/Wisconsin_schedule_2026.parquet` |
+| scouting | `scouting/season/{TeamName}_scouting_{YYYY}.parquet` | `scouting/season/Chicago_St._scouting_2026.parquet` |
+
+### Daily Data
+
+| Dataset | Path | Example |
+|-----------|-------------------------------------------|-------------------------------------------------|
+| ratings | `ratings/daily/ratings_{YYYY-MM-DD}.parquet` | `ratings/daily/ratings_2026-01-31.parquet` |
+| height | `height/daily/height_{YYYY-MM-DD}.parquet` | `height/daily/height_2026-01-31.parquet` |
+| fanmatch | `fanmatch/daily/fanmatch_{YYYY-MM-DD}.parquet` | `fanmatch/daily/fanmatch_2026-01-28.parquet` |
+
+## Current Season
+
+Default season for collection: **2026** (2025–26 NCAA season).
+
+Update `settings.kenpom_default_season` or pass `--season 2026` to the CLI when needed.
+
+## REST API Collection Output
+
+The `kenpom_collection` service writes:
+
+- **Season-level** (ratings, four-factors, misc-stats, pointdist, height): `{type}/season/{type}_{season}.parquet`
+- **Conference-only** variants: `{type}/season/{type}_conference_{season}.parquet`
+- **Daily** (fanmatch): `fanmatch/daily/fanmatch_{date}.parquet`
+
+## Reading Data
+
+```python
+from pathlib import Path
+from sports_betting_edge.adapters.filesystem import read_parquet
+
+kenpom_dir = Path("data/kenpom")
+
+# Season ratings
+ratings = read_parquet(kenpom_dir / "ratings/season/ratings_{season}.parquet")
+
+# Daily fanmatch
+games = read_parquet(kenpom_dir / "fanmatch/daily/fanmatch_{date}.parquet")
+```
+
+## Related Docs
+
+- [endpoints.md](endpoints.md) — API endpoints and field reference
+- [fields.md](fields.md) — Metric definitions and quality thresholds
diff --git a/docs/kenpom/endpoints.md b/docs/kenpom/endpoints.md
new file mode 100644
index 000000000..1ffc2634c
--- /dev/null
+++ b/docs/kenpom/endpoints.md
@@ -0,0 +1,723 @@
+# KenPom Data Endpoints Reference
+
+Complete documentation of all KenPom.com data access methods.
+
+**Data structure:** Output layout and file naming are documented in [data-structure.md](data-structure.md).
+
+## Data Access Methods
+
+KenPom data can be accessed two ways:
+
+| Method | Auth Type | Use Case |
+| ---------------------- | ------------------ | ------------------------------------- |
+| **kenpompy library** | Subscription login | HTML scraping, full endpoint coverage |
+| **REST API (api.php)** | API key | Direct JSON, cleaner field names |
+
+---
+
+## kenpompy Library Endpoints
+
+### Authentication
+
+```python
+from kenpompy.utils import login
+
+browser = login(email, password)
+# Returns: CloudScraper session with auth cookies
+```
+
+---
+
+### 1. Summary Endpoints (`kenpompy.summary`)
+
+#### get_efficiency(browser, season)
+
+**URL:** `https://kenpom.com/summary.php`
+**Min Year:** 1999
+
+Returns adjusted efficiency, tempo, and average possession length.
+
+| Column | Type | Description |
+| ----------------------------- | ----- | --------------------------------- |
+| Team | str | Team name |
+| Conference | str | Conference name |
+| Tempo-Adj | float | Adjusted tempo (poss/40 min) |
+| Tempo-Adj.Rank | int | Adjusted tempo rank |
+| Tempo-Raw | float | Raw tempo |
+| Tempo-Raw.Rank | int | Raw tempo rank |
+| Avg. Poss Length-Offense | float | Offensive possession length (sec) |
+| Avg. Poss Length-Offense.Rank | int | Off APL rank |
+| Avg. Poss Length-Defense | float | Defensive possession length (sec) |
+| Avg. Poss Length-Defense.Rank | int | Def APL rank |
+| Off. Efficiency-Adj | float | Adjusted offensive efficiency |
+| Off. Efficiency-Adj.Rank | int | Adj OE rank |
+| Off. Efficiency-Raw | float | Raw offensive efficiency |
+| Off. Efficiency-Raw.Rank | int | Raw OE rank |
+| Def. Efficiency-Adj | float | Adjusted defensive efficiency |
+| Def. Efficiency-Adj.Rank | int | Adj DE rank |
+| Def. Efficiency-Raw | float | Raw defensive efficiency |
+| Def. Efficiency-Raw.Rank | int | Raw DE rank |
+
+---
+
+#### get_fourfactors(browser, season)
+
+**URL:** `https://kenpom.com/stats.php`
+**Min Year:** 1999
+
+Returns Dean Oliver's Four Factors for offense and defense.
+
+| Column | Type | Description |
+| --------------- | ----- | ----------------------------- |
+| Team | str | Team name |
+| Conference | str | Conference name |
+| AdjTempo | float | Adjusted tempo |
+| AdjTempo.Rank | int | Tempo rank |
+| AdjOE | float | Adjusted offensive efficiency |
+| AdjOE.Rank | int | AdjOE rank |
+| Off-eFG% | float | Offensive effective FG% |
+| Off-eFG%.Rank | int | Off eFG% rank |
+| Off-TO% | float | Offensive turnover % |
+| Off-TO%.Rank | int | Off TO% rank |
+| Off-OR% | float | Offensive rebound % |
+| Off-OR%.Rank | int | Off OR% rank |
+| Off-FTRate | float | Offensive free throw rate |
+| Off-FTRate.Rank | int | Off FTRate rank |
+| AdjDE | float | Adjusted defensive efficiency |
+| AdjDE.Rank | int | AdjDE rank |
+| Def-eFG% | float | Defensive eFG% allowed |
+| Def-eFG%.Rank | int | Def eFG% rank |
+| Def-TO% | float | Defensive turnover % forced |
+| Def-TO%.Rank | int | Def TO% rank |
+| Def-OR% | float | Defensive rebound % allowed |
+| Def-OR%.Rank | int | Def OR% rank |
+| Def-FTRate | float | Defensive FTRate allowed |
+| Def-FTRate.Rank | int | Def FTRate rank |
+
+---
+
+#### get_teamstats(browser, season, defense=False)
+
+**URL:** `https://kenpom.com/teamstats.php`
+**Min Year:** 1999
+
+Returns detailed shooting and play statistics.
+
+| Column | Type | Description |
+| --------------------- | ----- | ------------------------------------------- |
+| Team | str | Team name |
+| Conference | str | Conference name |
+| 3P% | float | Three-point percentage |
+| 3P%.Rank | int | 3P% rank |
+| 2P% | float | Two-point percentage |
+| 2P%.Rank | int | 2P% rank |
+| FT% | float | Free throw percentage |
+| FT%.Rank | int | FT% rank |
+| Blk% | float | Block percentage |
+| Blk%.Rank | int | Block% rank |
+| Stl% | float | Steal percentage |
+| Stl%.Rank | int | Steal% rank |
+| NST% | float | Non-steal turnover % |
+| NST%.Rank | int | NST% rank |
+| A% | float | Assist percentage |
+| A%.Rank | int | Assist% rank |
+| 3PA% | float | Three-point attempt rate |
+| 3PA%.Rank | int | 3PA% rank |
+| AdjOE/AdjDE | float | Adjusted efficiency (O or D based on param) |
+| AdjOE.Rank/AdjDE.Rank | int | Efficiency rank |
+
+---
+
+#### get_pointdist(browser, season)
+
+**URL:** `https://kenpom.com/pointdist.php`
+**Min Year:** 1999
+
+Returns scoring distribution percentages.
+
+| Column | Type | Description |
+| ----------- | ----- | ----------------------------- |
+| Team | str | Team name |
+| Conference | str | Conference name |
+| Off-FT | float | % of offense from free throws |
+| Off-FT.Rank | int | Off FT% rank |
+| Off-2P | float | % of offense from 2-pointers |
+| Off-2P.Rank | int | Off 2P% rank |
+| Off-3P | float | % of offense from 3-pointers |
+| Off-3P.Rank | int | Off 3P% rank |
+| Def-FT | float | % of defense from FT allowed |
+| Def-FT.Rank | int | Def FT% rank |
+| Def-2P | float | % of defense from 2P allowed |
+| Def-2P.Rank | int | Def 2P% rank |
+| Def-3P | float | % of defense from 3P allowed |
+| Def-3P.Rank | int | Def 3P% rank |
+
+---
+
+#### get_height(browser, season)
+
+**URL:** `https://kenpom.com/height.php`
+**Min Year:** 2007
+
+Returns height, experience, and roster continuity.
+
+| Column | Type | Description |
+| --------------- | ----- | ----------------------------------- |
+| Team | str | Team name |
+| Conference | str | Conference name |
+| AvgHgt | float | Average height (inches) |
+| AvgHgt.Rank | int | Avg height rank |
+| EffHgt | float | Effective height (minutes-weighted) |
+| EffHgt.Rank | int | Eff height rank |
+| C-Hgt | float | Center height |
+| C-Hgt.Rank | int | Center height rank |
+| PF-Hgt | float | Power forward height |
+| PF-Hgt.Rank | int | PF height rank |
+| SF-Hgt | float | Small forward height |
+| SF-Hgt.Rank | int | SF height rank |
+| SG-Hgt | float | Shooting guard height |
+| SG-Hgt.Rank | int | SG height rank |
+| PG-Hgt | float | Point guard height |
+| PG-Hgt.Rank | int | PG height rank |
+| Experience | float | Team experience (years) |
+| Experience.Rank | int | Experience rank |
+| Bench | float | Bench minutes % |
+| Bench.Rank | int | Bench rank |
+| Continuity | float | Continuity % |
+| Continuity.Rank | int | Continuity rank |
+
+---
+
+### 2. Miscellaneous Endpoints (`kenpompy.misc`)
+
+#### get_pomeroy_ratings(browser, season)
+
+**URL:** `https://kenpom.com/index.php`
+**Min Year:** 1999
+
+Main ratings table from the KenPom homepage.
+
+| Column | Type | Description |
+| -------- | ----- | ----------------------------- |
+| Rk | int | Overall rank |
+| Team | str | Team name |
+| Conf | str | Conference |
+| W-L | str | Win-loss record |
+| AdjEM | float | Adjusted efficiency margin |
+| AdjO | float | Adjusted offensive efficiency |
+| AdjO.Rk | int | Adj OE rank |
+| AdjD | float | Adjusted defensive efficiency |
+| AdjD.Rk | int | Adj DE rank |
+| AdjT | float | Adjusted tempo |
+| AdjT.Rk | int | Adj tempo rank |
+| Luck | float | Luck rating |
+| Luck.Rk | int | Luck rank |
+| SOS | float | Strength of schedule |
+| SOS.Rk | int | SOS rank |
+| OppO | float | Opponent offensive efficiency |
+| OppO.Rk | int | OppO rank |
+| OppD | float | Opponent defensive efficiency |
+| OppD.Rk | int | OppD rank |
+| NCSOS | float | Non-conference SOS |
+| NCSOS.Rk | int | NCSOS rank |
+
+---
+
+### 3. FanMatch Endpoint (`kenpompy.FanMatch`)
+
+#### FanMatch(browser, date)
+
+**URL:** `https://kenpom.com/fanmatch.php?d={YYYY-MM-DD}`
+**Min Year:** 2014
+
+Game predictions and results for a specific date.
+
+**Object Attributes:**
+
+| Attribute | Type | Description |
+| ----------------------------- | --------- | ------------------------- |
+| url | str | Full URL of page |
+| date | str | Date scraped |
+| lines_o_night | list | Best games of the night |
+| ppg | float | Average points per game |
+| avg_eff | float | Average efficiency |
+| pos_40 | float | Possessions per 40 min |
+| mean_abs_err_pred_total_score | float | Prediction MAE for total |
+| bias_pred_total_score | float | Prediction bias for total |
+| mean_abs_err_pred_mov | float | Prediction MAE for MOV |
+| record_favs | str | Favorites record |
+| expected_record_favs | str | Expected favorites record |
+| exact_mov | str | Exact MOV predictions |
+| fm_df | DataFrame | Full game data |
+
+**fm_df Columns:**
+
+| Column | Type | Description |
+| -------------------- | ----- | --------------------------------- |
+| Game | str | Matchup description |
+| Location | str | Game location |
+| ThrillScore | float | Pre-game entertainment prediction |
+| Comeback | float | Comeback rating |
+| Excitement | float | Post-game excitement rating |
+| ThrillScoreRank | int | ThrillScore rank |
+| ExcitementRank | int | Excitement rank |
+| ComebackRank | int | Comeback rank |
+| MVP | str | Game MVP |
+| Tournament | str | Tournament designation |
+| Possessions | int | Game possessions |
+| PredictedWinner | str | Predicted winner |
+| PredictedScore | str | Predicted score |
+| WinProbability | float | Win probability |
+| PredictedPossessions | int | Predicted possessions |
+| PredictedMOV | int | Predicted margin of victory |
+| PredictedLoser | str | Predicted loser |
+| OT | str | Overtime indicator |
+| Loser | str | Actual loser |
+| LoserRank | int | Loser's rank |
+| LoserScore | int | Loser's score |
+| Winner | str | Actual winner |
+| WinnerRank | int | Winner's rank |
+| WinnerScore | int | Winner's score |
+| ActualMOV | int | Actual margin of victory |
+
+---
+
+## REST API Endpoints (api.php)
+
+The REST API provides cleaner JSON responses but requires a separate API key.
+
+### API Details
+
+- **Base URL:** `https://kenpom.com`
+- **Format:** JSON
+- **Auth:** Bearer token in `Authorization` header
+
+```
+GET /api.php?endpoint=ratings&y=2025
+Authorization: Bearer YOUR_API_KEY
+```
+
+### Python Client
+
+```python
+from src.api_client import KenPomAPI
+
+# Set KENPOM_API_KEY environment variable or pass directly
+api = KenPomAPI(api_key="your-api-key")
+```
+
+**Get your API key:** Log into KenPom.com and navigate to Account Settings.
+
+---
+
+### Working Endpoints (Verified January 2025)
+
+| Endpoint | Method | Params | Description |
+| ------------ | -------------------------- | --------------------------- | ---------------------------------- |
+| ratings | `get_ratings()` | y, team_id, c | Team ratings with efficiency |
+| archive | `get_archive()` | d, y, preseason, team_id, c | Historical ratings from past dates |
+| teams | `get_teams()` | y, c | Team lookup with IDs and arenas |
+| conferences | `get_conference_list()` | y | Conference list |
+| conf-ratings | `get_conferences()` | y, c | Conference ratings |
+| four-factors | `get_four_factors()` | y, team_id, c, conf_only | Dean Oliver's Four Factors |
+| pointdist | `get_point_distribution()` | y, team_id, c, conf_only | Point distribution by shot type |
+| height | `get_height()` | y, team_id, c | Height/experience (2007+) |
+| misc-stats | `get_misc_stats()` | y, team_id, c, conf_only | Shooting, blocks, steals, assists |
+| fanmatch | `get_fanmatch()` | d | Game predictions for a date |
+
+---
+
+### Endpoint Field Reference
+
+#### ratings
+
+| Field | Type | Description |
+| ------------ | ----- | ------------------------------- |
+| DataThrough | str | Date of last update |
+| Season | int | Season year |
+| TeamName | str | Team name |
+| Seed | int | NCAA tournament seed (if any) |
+| ConfShort | str | Conference abbreviation |
+| Coach | str | Head coach name |
+| Wins | int | Win count |
+| Losses | int | Loss count |
+| AdjEM | float | Adjusted efficiency margin |
+| RankAdjEM | int | AdjEM rank |
+| Pythag | float | Pythagorean win expectation |
+| RankPythag | int | Pythag rank |
+| AdjOE | float | Adjusted offensive efficiency |
+| RankAdjOE | int | AdjOE rank |
+| OE | float | Raw offensive efficiency |
+| RankOE | int | OE rank |
+| AdjDE | float | Adjusted defensive efficiency |
+| RankAdjDE | int | AdjDE rank |
+| DE | float | Raw defensive efficiency |
+| RankDE | int | DE rank |
+| Tempo | float | Raw tempo |
+| RankTempo | int | Tempo rank |
+| AdjTempo | float | Adjusted tempo |
+| RankAdjTempo | int | AdjTempo rank |
+| Luck | float | Luck rating |
+| RankLuck | int | Luck rank |
+| SOS | float | Strength of schedule |
+| RankSOS | int | SOS rank |
+| SOSO | float | SOS - opponent offense |
+| RankSOSO | int | SOSO rank |
+| SOSD | float | SOS - opponent defense |
+| RankSOSD | int | SOSD rank |
+| NCSOS | float | Non-conference SOS |
+| RankNCSOS | int | NCSOS rank |
+| APL_Off | float | Avg possession length (offense) |
+| RankAPL_Off | int | APL_Off rank |
+| APL_Def | float | Avg possession length (defense) |
+| RankAPL_Def | int | APL_Def rank |
+
+---
+
+#### four-factors
+
+| Field | Type | Description |
+| ------------ | ----- | ----------------------------- |
+| DataThrough | str | Date of last update |
+| ConfOnly | bool | Conference games only flag |
+| TeamName | str | Team name |
+| Season | int | Season year |
+| eFG_Pct | float | Effective FG% (offense) |
+| RankeFG_Pct | int | eFG% rank |
+| TO_Pct | float | Turnover % (offense) |
+| RankTO_Pct | int | TO% rank |
+| OR_Pct | float | Offensive rebound % |
+| RankOR_Pct | int | OR% rank |
+| FT_Rate | float | Free throw rate (offense) |
+| RankFT_Rate | int | FT rate rank |
+| DeFG_Pct | float | Defensive eFG% allowed |
+| RankDeFG_Pct | int | Def eFG% rank |
+| DTO_Pct | float | Defensive TO% forced |
+| RankDTO_Pct | int | Def TO% rank |
+| DOR_Pct | float | Defensive rebound % allowed |
+| RankDOR_Pct | int | Def OR% rank |
+| DFT_Rate | float | Defensive FT rate allowed |
+| RankDFT_Rate | int | Def FT rate rank |
+| AdjOE | float | Adjusted offensive efficiency |
+| RankAdjOE | int | AdjOE rank |
+| AdjDE | float | Adjusted defensive efficiency |
+| RankAdjDE | int | AdjDE rank |
+| AdjTempo | float | Adjusted tempo |
+| RankAdjTempo | int | AdjTempo rank |
+
+---
+
+#### pointdist
+
+| Field | Type | Description |
+| ----------- | ----- | -------------------------- |
+| DataThrough | str | Date of last update |
+| ConfOnly | bool | Conference games only flag |
+| Season | int | Season year |
+| TeamName | str | Team name |
+| ConfShort | str | Conference abbreviation |
+| OffFt | float | % of points from FT (off) |
+| RankOffFt | int | OffFt rank |
+| OffFg2 | float | % of points from 2P (off) |
+| RankOffFg2 | int | OffFg2 rank |
+| OffFg3 | float | % of points from 3P (off) |
+| RankOffFg3 | int | OffFg3 rank |
+| DefFt | float | % of points from FT (def) |
+| RankDefFt | int | DefFt rank |
+| DefFg2 | float | % of points from 2P (def) |
+| RankDefFg2 | int | DefFg2 rank |
+| DefFg3 | float | % of points from 3P (def) |
+| RankDefFg3 | int | DefFg3 rank |
+
+---
+
+#### height
+
+| Field | Type | Description |
+| -------------- | ----- | ----------------------------------- |
+| DataThrough | str | Date of last update |
+| Season | int | Season year |
+| TeamName | str | Team name |
+| ConfShort | str | Conference abbreviation |
+| AvgHgt | float | Average height (inches) |
+| AvgHgtRank | int | AvgHgt rank |
+| HgtEff | float | Effective height (minutes-weighted) |
+| HgtEffRank | int | HgtEff rank |
+| Hgt5 | float | Position 5 (center) height |
+| Hgt5Rank | int | Hgt5 rank |
+| Hgt4 | float | Position 4 (PF) height |
+| Hgt4Rank | int | Hgt4 rank |
+| Hgt3 | float | Position 3 (SF) height |
+| Hgt3Rank | int | Hgt3 rank |
+| Hgt2 | float | Position 2 (SG) height |
+| Hgt2Rank | int | Hgt2 rank |
+| Hgt1 | float | Position 1 (PG) height |
+| Hgt1Rank | int | Hgt1 rank |
+| Exp | float | Team experience (years) |
+| ExpRank | int | Exp rank |
+| Bench | float | Bench minutes % |
+| BenchRank | int | Bench rank |
+| Continuity | float | Roster continuity % |
+| RankContinuity | int | Continuity rank |
+
+---
+
+#### misc-stats
+
+| Field | Type | Description |
+| ----------------- | ----- | ----------------------------- |
+| DataThrough | str | Date of last update |
+| ConfOnly | bool | Conference games only flag |
+| Season | int | Season year |
+| TeamName | str | Team name |
+| ConfShort | str | Conference abbreviation |
+| FG3Pct | float | 3-point percentage (offense) |
+| RankFG3Pct | int | FG3Pct rank |
+| FG2Pct | float | 2-point percentage (offense) |
+| RankFG2Pct | int | FG2Pct rank |
+| FTPct | float | Free throw percentage |
+| RankFTPct | int | FTPct rank |
+| BlockPct | float | Block percentage |
+| RankBlockPct | int | BlockPct rank |
+| StlRate | float | Steal rate |
+| RankStlRate | int | StlRate rank |
+| NSTRate | float | Non-steal turnover rate |
+| RankNSTRate | int | NSTRate rank |
+| ARate | float | Assist rate |
+| RankARate | int | ARate rank |
+| F3GRate | float | 3-point attempt rate |
+| RankF3GRate | int | F3GRate rank |
+| Avg2PADist | float | Average 2-point shot distance |
+| RankAvg2PADist | int | Avg2PADist rank |
+| OppFG3Pct | float | Opponent 3P% allowed |
+| RankOppFG3Pct | int | OppFG3Pct rank |
+| OppFG2Pct | float | Opponent 2P% allowed |
+| RankOppFG2Pct | int | OppFG2Pct rank |
+| OppFTPct | float | Opponent FT% allowed |
+| RankOppFTPct | int | OppFTPct rank |
+| OppBlockPct | float | Opponent block % |
+| RankOppBlockPct | int | OppBlockPct rank |
+| OppStlRate | float | Opponent steal rate |
+| RankOppStlRate | int | OppStlRate rank |
+| OppNSTRate | float | Opponent non-steal TO rate |
+| RankOppNSTRate | int | OppNSTRate rank |
+| OppARate | float | Opponent assist rate |
+| RankOppARate | int | OppARate rank |
+| OppF3GRate | float | Opponent 3PA rate |
+| RankOppF3GRate | int | OppF3GRate rank |
+| OppAvg2PADist | float | Opponent avg 2P distance |
+| RankOppAvg2PADist | int | OppAvg2PADist rank |
+
+---
+
+#### teams
+
+| Field | Type | Description |
+| ---------- | ---- | ----------------------- |
+| Season | int | Season year |
+| TeamName | str | Team name |
+| TeamID | int | Unique team identifier |
+| ConfShort | str | Conference abbreviation |
+| Coach | str | Head coach name |
+| Arena | str | Home arena name |
+| ArenaCity | str | Arena city |
+| ArenaState | str | Arena state |
+
+---
+
+#### conferences
+
+| Field | Type | Description |
+| --------- | ---- | ----------------------- |
+| Season | int | Season year |
+| ConfID | int | Conference ID |
+| ConfShort | str | Conference abbreviation |
+| ConfLong | str | Full conference name |
+
+---
+
+#### conf-ratings
+
+| Field | Type | Description |
+| --------- | ----- | --------------------------------- |
+| Season | int | Season year |
+| ConfShort | str | Conference abbreviation |
+| ConfID | int | Conference ID |
+| ConfLong | str | Full conference name |
+| Rank | int | Conference strength rank |
+| Rating | float | NetRtg of .500 team in conference |
+
+---
+
+#### archive
+
+| Field | Type | Description |
+| ----------------- | ----- | ----------------------------- |
+| ArchiveDate | str | Date of snapshot |
+| Season | int | Season year |
+| Preseason | bool | Is preseason snapshot |
+| TeamName | str | Team name |
+| Seed | int | Tournament seed |
+| Event | str | Tournament event |
+| ConfShort | str | Conference abbreviation |
+| AdjEM | float | AdjEM at snapshot date |
+| RankAdjEM | int | AdjEM rank at snapshot |
+| AdjOE | float | AdjOE at snapshot date |
+| RankAdjOE | int | AdjOE rank at snapshot |
+| AdjDE | float | AdjDE at snapshot date |
+| RankAdjDE | int | AdjDE rank at snapshot |
+| AdjTempo | float | AdjTempo at snapshot date |
+| RankAdjTempo | int | AdjTempo rank at snapshot |
+| AdjEMFinal | float | Final AdjEM for comparison |
+| RankAdjEMFinal | int | Final AdjEM rank |
+| AdjOEFinal | float | Final AdjOE |
+| RankAdjOEFinal | int | Final AdjOE rank |
+| AdjDEFinal | float | Final AdjDE |
+| RankAdjDEFinal | int | Final AdjDE rank |
+| AdjTempoFinal | float | Final AdjTempo |
+| RankAdjTempoFinal | int | Final AdjTempo rank |
+| RankChg | int | Rank change from snapshot |
+| AdjEMChg | float | AdjEM change from snapshot |
+| AdjTChg | float | AdjTempo change from snapshot |
+
+---
+
+#### fanmatch
+
+| Field | Type | Description |
+| --------------- | ----- | ---------------------------- |
+| GameId | int | Unique game identifier |
+| Date | str | Game date |
+| Time | str | Game time |
+| Location | str | Game location type |
+| Arena | str | Arena name |
+| Team1 | str | First team name |
+| Team1Rank | int | First team rank |
+| Team2 | str | Second team name |
+| Team2Rank | int | Second team rank |
+| ThrillScore | float | Pre-game entertainment score |
+| PredictedWinner | str | Predicted winner |
+| PredictedScore | str | Predicted final score |
+| PredictedMOV | float | Predicted margin of victory |
+| WinProbability | float | Win probability for favorite |
+| ActualWinner | str | Actual winner (post-game) |
+| ActualScore | str | Actual final score |
+| ActualMOV | int | Actual margin of victory |
+| Excitement | float | Post-game excitement rating |
+| Tension | float | Post-game tension rating |
+| Comeback | float | Comeback rating |
+| MVP | str | Game MVP |
+| OT | str | Overtime indicator |
+
+---
+
+### Unavailable Endpoints (Not in REST API)
+
+The following endpoints are documented in older references but **do not work** via the REST API. Use kenpompy scraping instead if needed:
+
+| Endpoint | Alternative |
+| ----------- | --------------------------------------------- |
+| efficiency | Use `ratings` (contains AdjOE, AdjDE, APL) |
+| hca | Use default 3.5 HCA or scrape via kenpompy |
+| schedule | Scrape via kenpompy |
+| playerstats | Scrape via kenpompy |
+| refs | Scrape via kenpompy |
+| arenas | Use `teams` (contains Arena, ArenaCity, etc.) |
+
+---
+
+## Conference Codes
+
+| Code | Conference |
+| ----- | --------------------- |
+| A10 | Atlantic 10 |
+| ACC | Atlantic Coast |
+| AE | America East |
+| Amer | American Athletic |
+| ASun | Atlantic Sun |
+| B10 | Big Ten |
+| B12 | Big 12 |
+| BE | Big East |
+| BSky | Big Sky |
+| BSth | Big South |
+| BW | Big West |
+| CAA | Colonial Athletic |
+| CUSA | Conference USA |
+| Horz | Horizon |
+| Ivy | Ivy League |
+| MAAC | Metro Atlantic |
+| MAC | Mid-American |
+| MEast | Mid-Eastern Athletic |
+| MVC | Missouri Valley |
+| MWC | Mountain West |
+| NEC | Northeast |
+| OVC | Ohio Valley |
+| Pac | Pac-12 |
+| Pat | Patriot |
+| SB | Sun Belt |
+| SC | Southern |
+| SEC | Southeastern |
+| Slnd | Southland |
+| Sum | Summit |
+| SWAC | Southwestern Athletic |
+| WAC | Western Athletic |
+| WCC | West Coast |
+
+---
+
+## Season Availability
+
+| Data Type | REST API | kenpompy |
+| ------------------ | ----------- | --------- |
+| Ratings | 2002+ | 1999+ |
+| Four Factors | 2002+ | 1999+ |
+| Point Distribution | 2002+ | 1999+ |
+| Height/Experience | 2007+ | 2007+ |
+| FanMatch | 2014+ | 2014+ |
+| Misc Stats | 2002+ | 1999+ |
+| Archive | 2002+ | N/A |
+| Player Stats | N/A | 2004+ |
+| HCA | N/A | All years |
+| Refs | N/A | 2016+ |
+| Arenas | N/A (teams) | 2010+ |
+
+---
+
+## Usage Examples
+
+### REST API
+
+```python
+from src.api_client import KenPomAPI
+
+api = KenPomAPI()
+
+# Get current ratings
+ratings = api.get_ratings(season=2025)
+
+# Get four factors
+ff = api.get_four_factors(season=2025)
+
+# Get historical snapshot
+archive = api.get_archive(season=2024, archive_date="2024-03-15")
+
+# Get game predictions for a date
+games = api.get_fanmatch(game_date="2025-01-24")
+
+api.close()
+```
+
+### kenpompy Scraping
+
+```python
+from kenpompy.utils import login
+import kenpompy.summary as kp
+
+browser = login(email, password)
+
+# Get efficiency stats
+eff = kp.get_efficiency(browser, season="2025")
+
+# Get four factors
+ff = kp.get_fourfactors(browser, season="2025")
+```
diff --git a/docs/kenpom/fields.md b/docs/kenpom/fields.md
new file mode 100644
index 000000000..035338907
--- /dev/null
+++ b/docs/kenpom/fields.md
@@ -0,0 +1,607 @@
+# KenPom Field Definitions Reference
+
+Complete glossary of all KenPom metrics and their meanings.
+
+---
+
+## Core Efficiency Metrics
+
+### AdjEM (Adjusted Efficiency Margin)
+
+**Formula:** AdjO - AdjD
+
+The difference between a team's adjusted offensive and defensive efficiency. Represents expected point differential per 100 possessions against an average D1 team on a neutral court.
+
+| Range | Quality |
+|-------|---------|
+| > 25 | Elite |
+| 15-25 | Good |
+| 0-15 | Average |
+| < 0 | Poor |
+
+---
+
+### AdjO / AdjOE (Adjusted Offensive Efficiency)
+
+Points scored per 100 possessions, adjusted for opponent and location.
+
+| Range | Quality |
+|-------|---------|
+| > 120 | Elite |
+| 110-120 | Good |
+| 100-110 | Average |
+| < 100 | Poor |
+
+---
+
+### AdjD / AdjDE (Adjusted Defensive Efficiency)
+
+Points allowed per 100 possessions, adjusted for opponent and location. **Lower is better.**
+
+| Range | Quality |
+|-------|---------|
+| < 90 | Elite |
+| 90-98 | Good |
+| 98-105 | Average |
+| > 105 | Poor |
+
+---
+
+### AdjT / AdjTempo (Adjusted Tempo)
+
+Possessions per 40 minutes, adjusted for opponent tempo.
+
+| Range | Pace |
+|-------|------|
+| > 72 | Very Fast |
+| 68-72 | Fast |
+| 65-68 | Average |
+| < 65 | Slow |
+
+---
+
+### Pythag (Pythagorean Expectation)
+
+Expected win percentage based on points scored vs allowed, adjusted for schedule. Derived from the Pythagorean theorem applied to basketball.
+
+**Formula:** Points Scored^11.5 / (Points Scored^11.5 + Points Allowed^11.5)
+
+---
+
+## Tempo & Possession Metrics
+
+### APL / APLO / APLD (Average Possession Length)
+
+Average time in seconds per possession. APLO = offensive, APLD = defensive.
+
+| Range | Style |
+|-------|-------|
+| > 18 | Very deliberate |
+| 16-18 | Deliberate |
+| 14-16 | Average |
+| < 14 | Fast |
+
+---
+
+### Raw Tempo vs Adjusted Tempo
+
+- **Raw Tempo:** Actual possessions per 40 minutes
+- **Adjusted Tempo:** Tempo adjusted for opponent's pace; better for comparing teams
+
+---
+
+## Four Factors Metrics
+
+Dean Oliver's Four Factors explain ~90% of winning. Importance weights for NCAA basketball:
+
+| Factor | Importance |
+|--------|------------|
+| eFG% | ~40% |
+| TO% | ~25% |
+| OR% | ~20% |
+| FTRate | ~15% |
+
+---
+
+### eFG% (Effective Field Goal Percentage)
+
+**Formula:** (FGM + 0.5 × 3PM) / FGA
+
+Field goal percentage weighted to give 50% extra credit for made threes.
+
+| Side | Elite | Poor |
+|------|-------|------|
+| Offense | > 55% | < 48% |
+| Defense | < 45% | > 52% |
+
+---
+
+### TO% (Turnover Percentage)
+
+**Formula:** Turnovers / Possessions × 100
+
+Percentage of possessions ending in turnovers.
+
+| Side | Elite | Poor |
+|------|-------|------|
+| Offense | < 15% | > 20% |
+| Defense | > 22% | < 17% |
+
+---
+
+### OR% (Offensive Rebound Percentage)
+
+**Formula:** Offensive Rebounds / (Offensive Rebounds + Opponent Defensive Rebounds) × 100
+
+Percentage of available offensive rebounds grabbed.
+
+| Side | Elite | Poor |
+|------|-------|------|
+| Offense | > 35% | < 25% |
+| Defense | < 25% | > 32% |
+
+---
+
+### FTRate (Free Throw Rate)
+
+**Formula:** FTA / FGA
+
+Free throw attempts relative to field goal attempts. Measures ability to get to the line.
+
+| Side | Elite | Poor |
+|------|-------|------|
+| Offense | > 0.40 | < 0.28 |
+| Defense | < 0.25 | > 0.35 |
+
+---
+
+## Shooting Metrics
+
+### 3P% (Three-Point Percentage)
+
+Made threes / attempted threes.
+
+- **Elite:** > 38%
+- **Average:** 33-35%
+- **Poor:** < 30%
+
+---
+
+### 2P% (Two-Point Percentage)
+
+Made twos / attempted twos.
+
+- **Elite:** > 54%
+- **Average:** 48-50%
+- **Poor:** < 46%
+
+---
+
+### FT% (Free Throw Percentage)
+
+Made free throws / attempted free throws.
+
+- **Elite:** > 76%
+- **Average:** 70-72%
+- **Poor:** < 68%
+
+---
+
+### 3PA% (Three-Point Attempt Rate)
+
+**Formula:** 3PA / FGA
+
+Percentage of field goal attempts that are three-pointers. Measures offensive style.
+
+- **Three-heavy:** > 45%
+- **Average:** 35-40%
+- **Inside-oriented:** < 30%
+
+---
+
+### TS% (True Shooting Percentage)
+
+**Formula:** Points / (2 × (FGA + 0.44 × FTA))
+
+Overall shooting efficiency accounting for free throws and three-pointers.
+
+- **Elite:** > 62%
+- **Good:** 56-62%
+- **Average:** 52-56%
+- **Poor:** < 52%
+
+---
+
+## Turnover Metrics
+
+### Stl% (Steal Percentage)
+
+**Formula:** Steals / Opponent Possessions × 100
+
+Percentage of opponent possessions ending in a steal.
+
+- **Elite Defense:** > 12%
+- **Average:** 9-10%
+
+---
+
+### NST% (Non-Steal Turnover Percentage)
+
+**Formula:** (Turnovers - Steals) / Possessions × 100
+
+Turnovers not caused by steals (travels, bad passes, shot clock violations, etc.). Measures ball-handling discipline.
+
+- **Disciplined:** < 10%
+- **Careless:** > 14%
+
+---
+
+## Rebounding Metrics
+
+### A% (Assist Percentage)
+
+**Formula:** Assists / Field Goals Made × 100
+
+Percentage of made field goals that were assisted. Measures ball movement.
+
+- **Good ball movement:** > 60%
+- **Isolation-heavy:** < 50%
+
+---
+
+### Blk% (Block Percentage)
+
+**Formula:** Blocks / Opponent 2PA × 100
+
+Percentage of opponent two-point attempts blocked.
+
+- **Elite shot-blocking:** > 12%
+- **Average:** 8-10%
+- **Poor:** < 6%
+
+---
+
+## Height & Experience Metrics
+
+### AvgHgt (Average Height)
+
+Team's average height across all players in inches.
+
+- **Tall:** > 77" (6'5")
+- **Average:** 75-77" (6'3"-6'5")
+- **Short:** < 75" (6'3")
+
+---
+
+### EffHgt (Effective Height)
+
+Minutes-weighted average height. More representative of actual lineup height since it weights starters more heavily.
+
+---
+
+### C-Hgt, PF-Hgt, SF-Hgt, SG-Hgt, PG-Hgt
+
+Height by position (Center, Power Forward, Small Forward, Shooting Guard, Point Guard).
+
+---
+
+### Experience
+
+Average years of college experience (0-4 scale).
+
+- **Very Experienced:** > 2.5
+- **Experienced:** 2.0-2.5
+- **Average:** 1.5-2.0
+- **Young:** < 1.5
+
+---
+
+### Bench
+
+Percentage of minutes played by non-starters.
+
+- **Deep bench:** > 35%
+- **Average depth:** 28-32%
+- **Shallow bench:** < 25%
+
+---
+
+### Continuity
+
+Percentage of minutes played by returning players from previous season.
+
+- **High continuity:** > 70%
+- **Average:** 50-70%
+- **Low continuity:** < 40%
+
+---
+
+## Strength of Schedule Metrics
+
+### SOS-AdjEM
+
+Overall strength of schedule measured by average opponent AdjEM.
+
+- **Elite schedule:** > 10
+- **Strong schedule:** 5-10
+- **Average:** 0-5
+- **Weak schedule:** < 0
+
+---
+
+### SOS-OppO / SOSO
+
+Average opponent adjusted offensive efficiency faced.
+
+---
+
+### SOS-OppD / SOSD
+
+Average opponent adjusted defensive efficiency faced.
+
+---
+
+### NCSOS-AdjEM / NCSOS
+
+Non-conference strength of schedule. Important for committee evaluation.
+
+---
+
+## Luck Metrics
+
+### Luck
+
+Deviation from expected win-loss record based on game-by-game efficiency.
+
+- **Positive:** Team overperforming in close games
+- **Negative:** Team underperforming in close games
+- **Range:** Typically -0.10 to +0.10
+
+A team with +0.05 luck has won about 2-3 more close games than expected.
+
+---
+
+## Home Court Advantage Metrics
+
+### HCA (Home Court Advantage)
+
+Expected point swing for home team, team-specific. Average is about 3.5 points.
+
+| Range | HCA Strength |
+|-------|--------------|
+| > 4.5 | Strong HCA |
+| 3.5-4.5 | Average HCA |
+| < 3.0 | Weak HCA |
+
+---
+
+### Elev (Elevation)
+
+Arena elevation in feet above sea level. High altitude venues (> 5,000 ft) can provide additional home advantage due to visitor fatigue.
+
+Notable high-elevation arenas:
+- Colorado (5,430 ft)
+- Air Force (7,258 ft)
+- BYU (4,551 ft)
+
+---
+
+### PF (Personal Fouls Factor)
+
+Component of HCA related to foul differential at home vs away.
+
+---
+
+### Pts (Points Factor)
+
+Component of HCA related to points scored/allowed at home vs away.
+
+---
+
+## Player Statistics
+
+### ORtg (Offensive Rating)
+
+Points produced per 100 possessions used by the player.
+
+- **Elite:** > 125
+- **Good:** 115-125
+- **Average:** 105-115
+- **Poor:** < 100
+
+---
+
+### Poss% (Possessions Used)
+
+Percentage of team possessions used by player while on court.
+
+- **Primary option:** > 28%
+- **Secondary option:** 22-28%
+- **Role player:** < 20%
+
+---
+
+### ARate (Assist Rate)
+
+Percentage of teammate field goals assisted by player while on court.
+
+- **Elite passer:** > 30%
+- **Good passer:** 20-30%
+- **Non-playmaker:** < 15%
+
+---
+
+### FC40 (Fouls Committed per 40 min)
+
+Personal fouls committed per 40 minutes played.
+
+- **Foul prone:** > 4.5
+- **Average:** 3.0-4.0
+- **Disciplined:** < 2.5
+
+---
+
+### FD40 (Fouls Drawn per 40 min)
+
+Personal fouls drawn per 40 minutes played.
+
+- **Elite at drawing fouls:** > 6.0
+- **Average:** 3.5-5.0
+- **Rarely draws fouls:** < 2.5
+
+---
+
+## FanMatch Game Metrics
+
+### ThrillScore
+
+Pre-game prediction of how entertaining the game will be (0-100).
+
+| Score | Entertainment Value |
+|-------|-------------------|
+| > 85 | Must-watch |
+| 70-85 | Good game |
+| 50-70 | Average |
+| < 50 | Skip it |
+
+---
+
+### Excitement
+
+Post-game measure of how exciting the game actually was. Based on lead changes, close scoring, and drama.
+
+---
+
+### Tension
+
+How close/tense the game was throughout. High tension = competitive throughout.
+
+---
+
+### Comeback
+
+Magnitude of any comeback that occurred. High value indicates a significant deficit was overcome.
+
+---
+
+### Dominance
+
+How one-sided the game was. High = blowout.
+
+---
+
+### MVP
+
+The player who had the biggest impact on the game outcome.
+
+---
+
+### WinProbability
+
+Pre-game probability of the favored team winning based on efficiency metrics.
+
+---
+
+### PredictedMOV / ActualMOV
+
+Predicted and actual margin of victory.
+
+---
+
+## Referee Metrics
+
+### Ref Rating
+
+Overall referee quality rating based on game management.
+
+---
+
+### Game Score
+
+Average entertainment value of games officiated by this referee.
+
+---
+
+## Scouting Report Metrics
+
+### ShotDist / DShotDist
+
+Average shot distance in feet. Higher = more perimeter-oriented. Lower = more paint touches.
+
+- **Perimeter-heavy:** > 14 ft
+- **Balanced:** 11-14 ft
+- **Paint-oriented:** < 11 ft
+
+---
+
+### PD1, PD2, PD3 (Point Distribution)
+
+Percentage of points from free throws (PD1), two-pointers (PD2), and three-pointers (PD3).
+
+**Example balanced distribution:** PD1=20%, PD2=50%, PD3=30%
+
+---
+
+## Interpreting Rankings
+
+All rank columns are out of ~360+ D1 teams:
+
+| Rank Range | Percentile | Quality |
+|------------|------------|---------|
+| 1-36 | Top 10% | Elite |
+| 37-72 | Top 20% | Very Good |
+| 73-108 | Top 30% | Good |
+| 109-180 | 30-50% | Average |
+| 181-252 | 50-70% | Below Average |
+| 253-324 | 70-90% | Poor |
+| 325-363 | Bottom 10% | Very Poor |
+
+---
+
+## Key Metric Combinations
+
+### Power Rating
+
+AdjEM is the primary power rating - single best predictor of team quality.
+
+### Predicted Score
+
+**Formula:** (AdjO_A - AdjD_B + AdjD_A - AdjO_B) / 2 + HCA
+
+Where A is home team and B is away team.
+
+### Efficiency Margin Prediction
+
+To predict point differential:
+1. Calculate each team's expected efficiency vs the other's defense
+2. Multiply by expected possessions
+3. Add home court advantage
+
+### Luck-Adjusted Record
+
+Actual record minus luck factor shows "true" record. Useful for identifying overperforming/underperforming teams.
+
+---
+
+## Data Quality Notes
+
+1. **Early season:** Ratings are less stable with fewer games played (< 10 games)
+2. **Preseason:** Archive ratings available for preseason projections
+3. **Historical:** Data back to 1999 but some metrics start later
+4. **Conference play:** Some stats available for conference-only games via `conference_only` parameter
+5. **Neutral site:** Games classified as Home/Away/Neutral for adjustments
+6. **Values as strings:** `get_pomeroy_ratings` returns values as strings; convert to numeric as needed
+
+---
+
+## Metric Aliases
+
+Some metrics have different names depending on the endpoint:
+
+| Concept | get_pomeroy_ratings | get_fourfactors | get_efficiency | Scouting Report |
+|---------|--------------------|-----------------|--------------------|-----------------|
+| Offensive Efficiency | AdjO | AdjOE | Off. Efficiency-Adj | OE |
+| Defensive Efficiency | AdjD | AdjDE | Def. Efficiency-Adj | DE |
+| Tempo | AdjT | AdjTempo | Tempo-Adj | Tempo |
+| Experience | - | - | - | Experience |
+| Possession Length | - | - | Avg. Poss Length | APLO/APLD |
diff --git a/docs/reference/ODDS_API_COVERAGE_INVESTIGATION.md b/docs/reference/ODDS_API_COVERAGE_INVESTIGATION.md
new file mode 100644
index 000000000..0c1bf1f75
--- /dev/null
+++ b/docs/reference/ODDS_API_COVERAGE_INVESTIGATION.md
@@ -0,0 +1,203 @@
+# Odds API Coverage Investigation
+
+**Date**: 2026-02-01
+**Issue**: Missing scores for 1,144 past games (36% of events)
+
+## Investigation Summary
+
+### Root Cause
+**The Odds API does not provide comprehensive coverage of NCAA Division I basketball games.** We're only getting odds for games where US bookmakers actively offer betting lines.
+
+### Data Collection Status
+
+**Collection Script**: ✅ Working correctly
+- Fetching from correct endpoint: `/v4/sports/basketball_ncaab/odds/`
+- Using proper parameters: `regions=us,us2`, `markets=h2h,spreads,totals`
+- 15 US bookmakers tracked (DraftKings, FanDuel, BetMGM, Caesars, etc.)
+
+**Current Coverage**:
+```
+API returned today: 20 events
+Average per day (14-day): 41.8 events
+Range: 1-128 events per day
+```
+
+**Missing Games Example** (January 27, 2026):
+- ❌ UConn vs Providence
+- ❌ Indiana vs Purdue
+- ❌ Kentucky vs Vanderbilt
+- ❌ Alabama vs Missouri
+- ❌ Oklahoma vs Arkansas
+- ❌ Michigan vs Nebraska
+- ❌ Notre Dame vs Virginia
+- ✅ Alabama A&M vs Prairie View (we have this)
+- ✅ Bethune-Cookman vs Alcorn St (we have this)
+
+**Pattern**: The Odds API primarily covers:
+1. Lower-tier conferences (SWAC, Southland, Big South, MEAC)
+2. Select major conference games (inconsistent)
+3. Games where lines are currently open
+
+### Why High-Profile Games Are Missing
+
+**Timing Issues**:
+- Some major games don't have lines posted yet when we collect
+- Lines may close early for marquee matchups
+- Conference tournament games may have different coverage
+
+**Bookmaker Selectivity**:
+- Not all NCAA games get betting lines
+- Smaller conferences more likely to have active lines
+- Major conference games may have different release schedules
+
+### Coverage Statistics
+
+| Metric | Value | Notes |
+|--------|-------|-------|
+| Past events in DB | 1,745 | Total games we collected |
+| Events with scores | 1,117 (64%) | Games completed + scored |
+| Events missing scores | 1,144 (36%) | **Never in Odds API** |
+| ESPN match failures | 219 | ESPN has scores, we don't have events |
+
+### TeamMapper Fix Impact
+
+**Problem**: ESPN team names didn't match our mapping
+- "Miami (OH) RedHawks" vs "Miami Oh Redhawks"
+- "TCU Horned Frogs" vs "Tcu Horned Frogs"
+
+**Solution**: Added normalization to handle case/punctuation differences
+
+**Result**:
+- ✅ Added 7 new scores from ESPN
+- ✅ Reduced match failures from 227 → 219
+- ✅ Improved mapping robustness
+
+**Remaining 219 failures**: These are games ESPN has that **aren't in our Odds API events table** (we can't backfill scores for games we never collected)
+
+## Recommendations
+
+### Option 1: Accept Limited Coverage (Current State)
+**Pros**:
+- No changes needed
+- Focus on games with actual betting market
+- 64% coverage may be sufficient for CLV tracking
+
+**Cons**:
+- Missing most high-profile games
+- Smaller training dataset
+- Can't track lines for major matchups
+
+### Option 2: Multi-Source Event Collection
+**Strategy**: Use ESPN as primary event source, Odds API for odds only
+
+**Implementation**:
+```python
+# Daily collection flow:
+1. Fetch ALL games from ESPN scoreboard (comprehensive)
+2. Fetch odds from Odds API (limited to games with lines)
+3. Join: ESPN events + Odds API odds where available
+4. Result: Complete event coverage, partial odds coverage
+```
+
+**Pros**:
+- 100% event coverage
+- All scores available
+- Better for model training
+- Can identify which games don't get betting lines
+
+**Cons**:
+- More complex collection logic
+- Two data sources to maintain
+- ESPN doesn't provide odds, only scores
+
+### Option 3: Premium Odds Provider
+**Alternatives to investigate**:
+- **Pinnacle API**: Known for sharp lines, comprehensive coverage
+- **Action Network**: Detailed line movement data
+- **SportsDataIO**: Professional sports data aggregator
+- **RapidAPI Sports**: Various NCAA basketball feeds
+
+**Pros**:
+- Potentially better coverage
+- More reliable data
+- Professional support
+
+**Cons**:
+- Cost (Odds API is free tier)
+- API integration work
+- May still have coverage gaps
+
+### Option 4: Hybrid Approach (Recommended)
+**Strategy**: ESPN events + Odds API odds + selective premium data
+
+**Phase 1** (Immediate):
+1. ✅ Keep current Odds API collection
+2. ✅ Use ESPN for comprehensive event list
+3. ✅ Match odds to events where available
+4. ✅ Flag games without odds for analysis
+
+**Phase 2** (Future):
+1. Analyze which games consistently lack odds
+2. Evaluate if premium provider fills gaps
+3. Consider cost/benefit of additional source
+
+**Implementation**:
+```python
+# Modified collect_daily.py:
+1. Fetch ESPN games (free, comprehensive)
+2. Fetch Odds API odds (free, limited)
+3. Store all events, mark odds availability
+4. Use KenPom data regardless of odds status
+```
+
+## Next Steps
+
+### Immediate Actions
+1. ✅ **Fixed**: TeamMapper normalization for case-insensitive matching
+2. ✅ **Complete**: Investigation of Odds API coverage
+3. **Decide**: Accept current coverage OR implement multi-source approach
+
+### If Implementing Multi-Source (Option 4)
+1. Modify `collect_daily.py` to fetch ESPN scoreboard first
+2. Create event records for ALL games (not just those with odds)
+3. Update schema to track `has_odds` flag
+4. Adjust validation to expect partial odds coverage
+
+### Data Quality Impact
+Current state:
+- 64% of past events have scores ✅
+- 36% missing (not in Odds API) ❌
+
+With ESPN-first approach:
+- 100% of past events have scores ✅
+- ~40-60% have odds (Odds API coverage)
+- Clear distinction between "no odds" vs "missing data"
+
+## Technical Details
+
+### Odds API Constraints
+- **Sport**: `basketball_ncaab` (correct, only NCAA option)
+- **Regions**: `us,us2` (all US bookmakers)
+- **Markets**: `h2h,spreads,totals` (all markets)
+- **Historical scores**: 3-day limit
+
+### Bookmakers Tracked (15 total)
+- DraftKings, FanDuel, BetMGM, Caesars (major)
+- BetRivers, PointsBet, Barstool, WynnBET (regional)
+- Bovada, BetOnline, BetUS (offshore)
+- Fliff, Fanatics, Hard Rock (newer)
+
+### Collection Performance
+- API quota: 4.9M remaining (plenty)
+- Collection time: ~10-15 seconds
+- Storage: ~1,250 observations per collection
+- Frequency: Can run multiple times daily
+
+## Conclusion
+
+The 1,144 missing scores are **not a bug** - they're a fundamental limitation of The Odds API's coverage. The API only provides data for games where US bookmakers offer lines, which excludes:
+- Games with lines not yet posted
+- Games with early line closures
+- Some conferences/matchups bookmakers don't cover
+
+**Recommended path forward**: Implement **Option 4 (Hybrid)** to get comprehensive event coverage via ESPN while maintaining free Odds API for line data. This provides 100% score coverage while clearly separating "no odds available" from "missing data".
diff --git a/docs/reference/ODDS_API_DB_INSPECTION.md b/docs/reference/ODDS_API_DB_INSPECTION.md
new file mode 100644
index 000000000..86327bee2
--- /dev/null
+++ b/docs/reference/ODDS_API_DB_INSPECTION.md
@@ -0,0 +1,65 @@
+# Odds API Database Inspection Report
+
+## Summary
+
+| Item | Status | Notes |
+| -------------------------------------------------- | ----------------------------- | -------------------------------------------------------------- |
+| **Database file** `data/odds_api/odds_api.sqlite3` | Gitignored, may exist locally | `.gitignore` excludes `*.sqlite3` |
+| **`_ensure_views_exist()`** | Exists | In `odds_api_db.py`, creates normalized views on first connect |
+| **`_ensure_indexes_exist()`** | Does NOT exist | `create_indexes.sql` is never applied automatically |
+| **Schema initialization** | Missing | No `CREATE TABLE` for events/observations/scores in repo |
+| **Indexes** | Never applied | Script must be run manually |
+
+---
+
+## Upstream (Data producers)
+
+| Component | Role | Impact |
+| ------------------------------ | ------------------------------------ | --------------------------------------------- |
+| **`scripts/collect_daily.py`** | Inserts events, observations, scores | Assumes tables exist; fails if schema missing |
+| **Odds API HTTP** | Source of odds/scores data | N/A (external) |
+
+**Gap**: `collect_daily.py` does `INSERT INTO observations` / `INSERT OR REPLACE INTO events` but never creates the schema. First run fails with "no such table".
+
+---
+
+## Downstream (Data consumers)
+
+| Component | Role | Impact if DB/schema missing |
+| ------------------------------------------ | --------------------------------- | --------------------------------------------- |
+| **`OddsAPIDatabase`** | Adapter, creates views on connect | Raises `FileNotFoundError` if DB path missing |
+| **`FeatureEngineer`** | Builds ML datasets | Depends on OddsAPIDatabase |
+| **`scripts/build_training_datasets.py`** | CLI for training data | Fails if DB missing |
+| **`scripts/build_datasets_espn_odds.py`** | ESPN + odds merge | Fails if DB missing |
+| **`scripts/create_team_mapping.py`** | Team mapping from events | Fails if DB missing |
+| **`scripts/force_update_views.py`** | Recreate views | Fails if DB missing |
+| **`scripts/test_odds_api_integration.py`** | Integration test | Fails if DB missing |
+| **`scripts/train_walkforward.py`** | Walk-forward training | Fails if DB missing |
+
+---
+
+## Current Bootstrap Flow
+
+1. **Database file**: Must exist before any use. Adapter raises if path missing.
+2. **Schema (tables)**: Not created by any script. Assumed to exist (or created manually).
+3. **Views**: Created by `_ensure_views_exist()` on first `OddsAPIDatabase` connect.
+4. **Indexes**: Never applied. `sql/create_indexes.sql` exists but is not wired.
+
+---
+
+## Implemented Fixes
+
+1. **`sql/create_odds_api_schema.sql`** – Added `CREATE TABLE IF NOT EXISTS` for events, observations, scores.
+2. **`_ensure_schema_exists()`** – Runs on first connect; creates tables if missing (idempotent).
+3. **`_ensure_indexes_exist()`** – Runs on first connect; applies `create_indexes.sql` (idempotent).
+4. **Bootstrap on connect** – Parent directory and DB file created if missing; schema → indexes → views applied in order.
+5. **Bootstrap order** – `conn` property: `_ensure_schema_exists()` → `_ensure_indexes_exist()` → `_ensure_views_exist()`.
+
+## Manual Application (if needed)
+
+To apply schema or indexes without using the adapter:
+
+```bash
+sqlite3 data/odds_api/odds_api.sqlite3 < sql/create_odds_api_schema.sql
+sqlite3 data/odds_api/odds_api.sqlite3 < sql/create_indexes.sql
+```
diff --git a/docs/reference/architecture.md b/docs/reference/architecture.md
new file mode 100644
index 000000000..cf24b3a47
--- /dev/null
+++ b/docs/reference/architecture.md
@@ -0,0 +1,75 @@
+# Architecture
+
+System design and layering for the sports-betting-edge toolkit.
+
+## Domain
+
+- **Focus:** NCAA Men's Basketball betting edge research.
+- **Success metric:** Closing Line Value (CLV), not win percentage.
+- **Data storage:** Parquet preferred for analytical datasets.
+
+## Layering
+
+Dependency direction: **core → services → adapters**. No I/O in core.
+
+| Layer | Location | Responsibility |
+|-------|----------|----------------|
+| **Core** | `src/sports_betting_edge/core/` | Domain models, types, pure functions, exceptions. No network, no file I/O, no DB. |
+| **Services** | `src/sports_betting_edge/services/` | Workflow orchestration and business rules. Call adapters; return domain objects. |
+| **Adapters** | `src/sports_betting_edge/adapters/` | All external I/O: HTTP clients, DB access, filesystem, browser automation. |
+| **Config** | `src/sports_betting_edge/config/` | Settings from env, logging configuration only. |
+
+## Data Sources
+
+| Source | Purpose | Adapter / Integration |
+|--------|---------|------------------------|
+| **KenPom** | Efficiency, Four Factors, tempo, team ratings | `adapters/kenpom.py`; services: `kenpom_collection.py`. See `docs/kenpom/`. |
+| **The Odds API** | Odds, line movement, scores | `adapters/odds_api.py`; services: `odds_collection.py`. |
+| **ESPN** | Schedule, scoreboard, teams, logos (API + browser) | `adapters/espn.py` (scoreboard, teams, CDN logos); Puppeteer script `puppeteer/capture_espn_schedule.js`. See **ESPN data model:** `docs/espn-data-model.md`. |
+| **Overtime / Action Network** | (Planned) | Not yet implemented. |
+
+## Components
+
+- **CLI** (`cli.py`): Typer app; commands for KenPom, Odds, ESPN schedule collection. Entry: `python -m sports_betting_edge` or `sports_betting_edge`.
+- **API** (`api/`): FastAPI scaffold (routers, health). Run when implemented: `uv run uvicorn sports_betting_edge.api.main:app --reload`.
+- **Scraper** (`scraper/`): Scrapy-based crawl scaffold (items, spiders, pipelines). For generic crawls.
+- **Scraper prod** (`scraper_prod/`): Production scraper scaffold; can drive Puppeteer/Playwright for ESPN schedule capture.
+- **Puppeteer** (`puppeteer/`): Node scripts for browser automation (e.g. `capture_espn_schedule.js`). Run with Node: `node puppeteer/capture_espn_schedule.js [--date YYYYMMDD]`.
+
+## Data Layout
+
+| Directory | Purpose |
+|-----------|---------|
+| `data/raw/` | Raw scraped/ingested data |
+| `data/processed/` | Processed/transformed datasets |
+| `data/analysis/` | Analysis outputs (e.g. KenPom vs odds edge analysis) |
+| `data/kenpom/` | KenPom ratings, FanMatch data |
+| `data/overtime/` | Overtime.ag outputs (parquet + JSON), partitioned by date. Tracker snapshots live under `data/overtime/tracker/`. |
+| `data/espn/` | ESPN schedule, teams, team logos |
+| `data/odds_api/` | The Odds API snapshots and DB: `odds//`, `scores//YYYY-MM-DD/`, `stream//`, plus `odds_api.sqlite3`. |
+
+Analysis CSV outputs are written to `data/analysis/` by default (e.g. `analysis_2026-01-31.csv`). Use `--output` to override, or `--output-dir` to change the base directory.
+
+## ML / Feature Importance
+
+XGBoost is used to discover which KenPom stats best predict FanMatch win probability:
+
+- **Script:** `scripts/xgboost_feature_importance.py`
+- **Features:** KenPom ratings (AdjEM, AdjOE, AdjDE, AdjTempo, Luck, SOS), Four Factors, misc stats
+- **Target:** FanMatch HomeWP (KenPom's published win probability)
+- **Output:** Feature importance by gain (which stats matter most)
+- **Requires:** `uv sync --extra ml` (xgboost, scikit-learn, pandas)
+
+Use feature importance to guide feature engineering for walk-forward ML models (spreads, totals).
+
+## Configuration
+
+- Env-based settings via `config/settings.py` (Pydantic Settings).
+- Logging: `config/logging.py`; level via `LOG_LEVEL`, optional JSON via `LOG_FORMAT=json`.
+- Optional OpenTelemetry: `OTEL_*` env vars; see `utils/otel.py`.
+
+## References
+
+- **KenPom:** `docs/kenpom/endpoints.md`, `docs/kenpom/fields.md`
+- **Odds normalization:** CLAUDE.md (normalize-odds skill); no signed spreads in models.
+- **Decisions:** `docs/decisions.md`
diff --git a/docs/reference/decisions.md b/docs/reference/decisions.md
new file mode 100644
index 000000000..65ac68bdc
--- /dev/null
+++ b/docs/reference/decisions.md
@@ -0,0 +1,45 @@
+# Decisions
+
+Architecture decision records (ADRs) for this project.
+
+---
+
+## ADR-1: Strict src layout and layering
+
+**Status:** Accepted
+
+**Context:** We need a clear place for domain logic, orchestration, and I/O so that tests and refactors stay predictable and core logic stays free of external dependencies.
+
+**Decision:**
+
+- All importable Python code lives under `src/sports_betting_edge/` (src layout). No importable modules at repo root.
+- Layering: **core** (pure, no I/O) → **services** (orchestration, call adapters) → **adapters** (HTTP, DB, FS, browser).
+- Core must not depend on services or adapters. Services may depend on core and adapters. Adapters perform all I/O.
+
+**Consequences:**
+
+- Tests can unit-test core and services with mocked adapters. Integration tests target adapters.
+- New features follow: add/update types in core, orchestration in services, I/O in adapters.
+
+---
+
+## ADR-2: Odds and spreads decomposed (no signed values in models)
+
+**Status:** Accepted
+
+**Context:** Sports odds use signs to encode meaning (favorite/underdog, over/under). Storing signed numbers in models leads to sign-convention bugs and ambiguous deltas.
+
+**Decision:**
+
+- Numeric fields represent **one concept only**. We decompose:
+ - Magnitude (always positive), e.g. `spread_points = 6.5`
+ - Role via explicit flags/enums, e.g. `is_favorite`, `SideRole.OVER`
+- We do **not** store signed spreads (e.g. `-6.5`) or use `+1/-1` as role. American odds are converted to implied probability where needed.
+
+**Consequences:**
+
+- All new code involving spreads, totals, or moneylines must follow the normalize-odds skill/spec. See CLAUDE.md.
+
+---
+
+*Add new ADRs above with a short title, status, context, decision, and consequences.*
diff --git a/docs/reference/espn-data-model.md b/docs/reference/espn-data-model.md
new file mode 100644
index 000000000..df7b29e81
--- /dev/null
+++ b/docs/reference/espn-data-model.md
@@ -0,0 +1,56 @@
+# ESPN Data Model and Linking
+
+How ESPN data is linked and related in this project. **Use ESPN team ID as the single canonical key** for all team-scoped data.
+
+## Canonical key: ESPN team ID
+
+- **Type:** string (e.g. `"150"`, `"2710"`).
+- **Stable:** Same ID is used across scoreboard, teams API, CDN logos, and team detail/roster endpoints.
+- **Source of truth:** The [ESPN teams API](https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/teams) returns the full list; `id` on each team is the team ID.
+- **In this repo:** `ESPNTeam.team_id` and `ESPNGame.home_team_id` / `away_team_id` all use this value.
+
+## How each asset links to team ID
+
+| Asset | Link to team | Where it lives | Join / usage |
+|-------|----------------|----------------|--------------|
+| **Team metadata** | Primary key | `data/espn/teams/espn_team_names_*.parquet` | One row per team: `team_id`, `display_name`, `abbreviation`, `slug`. **Use this as the team index** for “which teams exist” and for display names. |
+| **Schedule / games** | Foreign key | `data/espn/schedule/*.parquet` | Each game has `home_team_id`, `away_team_id`. Filter by team: “all games where `home_team_id = id` or `away_team_id = id`.” |
+| **Logos** | Filename (display name) | `data/espn/team_logos/{slug}.png` | One file per team. Saved as **slug** (e.g. `boston-college-eagles.png`) for clarity and future use; image is still fetched via team_id. Collision: `{slug}_{team_id}.png`. |
+| **Roster** (future) | URL/param | Team detail API | Endpoint pattern: `.../teams/{team_id}` or `.../teams/{team_id}/roster`; same `team_id`. |
+| **Team stats** (future) | URL/param | Team or summary API | Same `team_id` in request; store with `team_id` column for join. |
+
+## Relating everything in practice
+
+1. **Team index**
+ Treat the teams Parquet file as the **master list of teams**: each row has `team_id` and names. All other ESPN data joins or references this via `team_id`.
+
+2. **Schedule → teams**
+ - To get “all games for team X”: filter schedule rows where `home_team_id == X` or `away_team_id == X`.
+ - To attach display names: left-join schedule to the teams table twice (on `home_team_id` and `away_team_id`) and take `display_name` for home and away.
+
+3. **Logos**
+ - Stored by **display name (slug)** for clarity: `data/espn/team_logos/{slug}.png` (e.g. `boston-college-eagles.png`).
+ - To get “team + logo path”: from the teams table, derive filename from `slug` or slugify(`display_name`); same logic as `services/espn_team_logos._logo_filename`.
+
+4. **Future roster / stats**
+ - Ingest with a `team_id` column. Join to the same teams table and to schedule (e.g. by `game_id` for roster-by-game) as needed.
+
+## File layout (current)
+
+```
+data/espn/
+├── teams/
+│ └── espn_team_names_2026.parquet # team_id, display_name, abbreviation, slug ← team index
+├── schedule/
+│ └── 2026-01-31.parquet # game_id, home_team_id, away_team_id, ...
+└── team_logos/
+ └── {slug}.png # one image per team (e.g. boston-college-eagles.png)
+```
+
+## Cross-source linking (ESPN ↔ KenPom, Odds, etc.)
+
+- **Within ESPN:** always use **team ID**; no need for name matching.
+- **Across sources:** ESPN has no universal external ID; we use **display names** (and variants) plus fuzzy matching.
+ See `src/sports_betting_edge/utils/team_matching.py`: match Odds API / Overtime names to KenPom names. To link ESPN → KenPom, use `ESPNTeam.display_name` (or `short_display_name`) as input to the same matching pipeline; store the result as an optional `kenpom_name` or use it only at analysis time.
+
+**Summary:** Use **ESPN team ID** for all ESPN-internal linking (schedule, logos, future roster/stats). Keep the **teams Parquet** as the team index. For cross-source (e.g. ESPN ↔ KenPom), use team names and the existing `team_matching` utilities.
diff --git a/docs/reference/github-actions-pipeline.md b/docs/reference/github-actions-pipeline.md
new file mode 100644
index 000000000..7336d6ed9
--- /dev/null
+++ b/docs/reference/github-actions-pipeline.md
@@ -0,0 +1,400 @@
+# GitHub Actions Daily Data Pipeline
+
+## Overview
+
+Fully automated daily data collection and staging consolidation workflow running on GitHub Actions. Eliminates need for local scheduling and ensures consistent daily updates.
+
+**What it does:**
+1. Collects raw data from ESPN and The Odds API (via `collect_hybrid.py`)
+2. Consolidates raw data into ML-ready staging files (via `consolidate_staging.py`)
+3. Commits updated staging files back to repository
+4. Creates GitHub issue on failure for alerting
+
+**Benefits:**
+- No local machine required - runs in cloud
+- Consistent execution schedule
+- Automatic error notifications
+- Full audit trail via git commits
+- Free for public repositories
+
+## Setup
+
+### 1. Configure GitHub Secrets
+
+Add required API credentials to your repository secrets:
+
+1. Navigate to: `Settings` → `Secrets and variables` → `Actions`
+2. Click **New repository secret**
+3. Add the following secret:
+ - Name: `ODDS_API_KEY`
+ - Value: Your Odds API key from https://the-odds-api.com/
+
+### 2. Enable Actions
+
+If this is your first GitHub Action in the repository:
+
+1. Navigate to: `Actions` tab in GitHub
+2. Click **I understand my workflows, go ahead and enable them**
+
+The workflow file is already in `.github/workflows/daily-data-pipeline.yml`.
+
+### 3. Verify Workflow is Scheduled
+
+1. Go to `Actions` tab
+2. Click on **Daily Data Pipeline** workflow
+3. Confirm the cron schedule shows: `At 14:00 UTC daily` (6 AM Pacific)
+
+## Workflow Details
+
+### Schedule
+
+**Default**: Daily at 6:00 AM Pacific (14:00 UTC)
+
+To change schedule, edit `.github/workflows/daily-data-pipeline.yml`:
+
+```yaml
+on:
+ schedule:
+ # Run daily at 6 AM Pacific (2 PM UTC)
+ - cron: '0 14 * * *'
+```
+
+**Cron syntax:**
+- `0 14 * * *` = 14:00 UTC (6:00 AM Pacific) daily
+- `0 10,14,18,22 * * *` = 10:00, 14:00, 18:00, 22:00 UTC (4x daily)
+- `0 6 * * 1-5` = 06:00 UTC Monday-Friday only
+
+### Jobs
+
+The workflow runs three sequential jobs:
+
+#### 1. Collect (`collect`)
+- Runs `scripts/collect_hybrid.py`
+- Collects events from ESPN (comprehensive coverage)
+- Collects odds from The Odds API (betting lines)
+- Stores in SQLite database
+- Uploads artifacts for consolidation
+
+#### 2. Consolidate (`consolidate`)
+- Depends on `collect` job completion
+- Runs `scripts/consolidate_staging.py`
+- Consolidates raw data into staging layer:
+ - `events.parquet` - Unified event catalog
+ - `line_features.parquet` - Pre-computed line movements
+ - `team_ratings.parquet` - KenPom ratings with name mapping
+ - `metadata.json` - Build timestamp and coverage stats
+- Commits staging updates to repository (scheduled runs only)
+- Uploads staging artifacts
+
+#### 3. Notify (`notify`)
+- Only runs on failure
+- Creates GitHub issue with error details
+- Labels: `automation`, `bug`
+- Includes workflow run URL for debugging
+
+## Manual Triggers
+
+You can manually trigger the workflow with custom options:
+
+1. Go to `Actions` tab
+2. Click **Daily Data Pipeline**
+3. Click **Run workflow** dropdown
+4. Configure options:
+ - **Skip odds collection**: Run ESPN collection only (saves API credits)
+ - **Force staging rebuild**: Rebuild even if staging is recent
+5. Click **Run workflow**
+
+### Common Manual Scenarios
+
+**Test collection without using API credits:**
+```
+✓ Skip odds collection (ESPN only)
+✗ Force staging rebuild
+```
+
+**Force full rebuild:**
+```
+✗ Skip odds collection
+✓ Force staging rebuild
+```
+
+## Monitoring
+
+### View Workflow Runs
+
+1. Navigate to `Actions` tab
+2. Click **Daily Data Pipeline**
+3. View run history with status indicators:
+ - ✓ Green = Success
+ - ✗ Red = Failed
+ - ◷ Yellow = Running
+
+### Check Logs
+
+Click on any workflow run to see detailed logs for each job:
+
+1. **Collect** - Raw data collection logs
+2. **Consolidate** - Staging consolidation logs
+3. **Notify** - Error notification (if failed)
+
+### Download Artifacts
+
+Each workflow run uploads artifacts for inspection:
+
+- **raw-data** (7 days retention)
+ - `odds_api.sqlite3` - Raw SQLite database
+ - `hybrid_collection.log` - Collection logs
+
+- **staging-data** (30 days retention)
+ - `*.parquet` - Staging layer files
+ - `metadata.json` - Build metadata
+
+To download:
+1. Click on workflow run
+2. Scroll to **Artifacts** section
+3. Click artifact name to download ZIP
+
+### Git Commits
+
+Successful runs commit staging updates:
+
+```
+commit abc123...
+Author: github-actions[bot]
+
+chore: update staging data
+
+Automated daily update from data pipeline
+
+Co-Authored-By: Claude Sonnet 4.5
+```
+
+View commit history: `git log --grep="staging data"`
+
+## Monitoring Staging Data
+
+### Check Latest Build
+
+```bash
+# View staging metadata
+cat data/staging/metadata.json
+
+# Output:
+{
+ "built_at": "2026-02-05T14:05:23.456789",
+ "coverage": {
+ "events": 555,
+ "with_scores": 555,
+ "with_line_features": 551,
+ "teams_with_ratings": 362
+ },
+ "feature_coverage_pct": 99.3,
+ "score_coverage_pct": 100.0
+}
+```
+
+### Verify Staging Files
+
+```bash
+# Check file sizes
+ls -lh data/staging/
+
+# Count rows
+uv run python -c "
+import pandas as pd
+
+events = pd.read_parquet('data/staging/events.parquet')
+features = pd.read_parquet('data/staging/line_features.parquet')
+ratings = pd.read_parquet('data/staging/team_ratings.parquet')
+
+print(f'Events: {len(events)} games')
+print(f'Line features: {len(features)} games ({len(features)/len(events)*100:.1f}% coverage)')
+print(f'Team ratings: {len(ratings)} teams')
+"
+```
+
+## Error Handling
+
+### Automatic Issue Creation
+
+On workflow failure, an issue is automatically created:
+
+**Title:** `Daily Data Pipeline Failed - 2026-02-05`
+
+**Body:**
+```
+The daily data pipeline workflow has failed.
+
+**Workflow Run:** https://github.com/omalleyandy/sports-betting-edge/actions/runs/12345
+
+**Date:** 2026-02-05T14:05:23.456Z
+
+Please check the logs and investigate the failure.
+```
+
+**Labels:** `automation`, `bug`
+
+### Common Failures
+
+#### ODDS_API_KEY not set
+**Symptom:** Collect job fails with "ODDS_API_KEY environment variable not set"
+
+**Fix:**
+1. Go to `Settings` → `Secrets and variables` → `Actions`
+2. Verify `ODDS_API_KEY` secret exists and is correct
+3. Re-run workflow
+
+#### API Quota Exceeded
+**Symptom:** Collect job fails with HTTP 429 or quota error
+
+**Fix:**
+1. Check API quota at https://the-odds-api.com/
+2. Upgrade API plan or reduce collection frequency
+3. Use manual trigger with "Skip odds collection" enabled
+
+#### KenPom Data Missing
+**Symptom:** Consolidate job fails with "KenPom ratings not found"
+
+**Fix:**
+1. Ensure `data/kenpom/ratings/season/ratings_2026.parquet` exists
+2. Run KenPom collection locally: `uv run python scripts/collect_kenpom_ratings.py`
+3. Commit and push KenPom data
+4. Re-run workflow
+
+#### Staging Validation Failed
+**Symptom:** Consolidate job completes but validation fails
+
+**Fix:**
+1. Check consolidation logs in workflow run
+2. Verify raw database has data: Download raw-data artifact
+3. Investigate specific validation error in logs
+4. Fix issue locally and test with `uv run python scripts/consolidate_staging.py`
+
+## Cost Analysis
+
+### GitHub Actions Minutes
+
+**Free tier (public repositories):**
+- Unlimited minutes for public repos
+- This workflow uses ~5-10 minutes per run
+- **Cost: $0**
+
+**Private repositories:**
+- 2,000 minutes/month free (Free plan)
+- 3,000 minutes/month (Pro plan)
+- This workflow: ~10 min/day = 300 min/month
+- **Cost: Free within tier limits**
+
+### API Credits
+
+**The Odds API usage per run:**
+- 1 credit for odds (if not skipped)
+- Minimal credits for ESPN (free API)
+- **Daily: 30 credits/month**
+- **Monthly: 900 credits (well within 500K free tier)**
+
+## Workflow File Location
+
+`.github/workflows/daily-data-pipeline.yml`
+
+To modify workflow:
+1. Edit the YAML file locally
+2. Commit and push changes
+3. GitHub Actions automatically picks up changes
+
+## Integration with Training Pipeline
+
+The staging layer is automatically updated daily. Training scripts use these files:
+
+```python
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+# FeatureEngineer automatically loads from staging
+engineer = FeatureEngineer(staging_path="data/staging/")
+
+# Build features for training
+features = engineer.build_training_features()
+```
+
+**Recommended training workflow:**
+1. **Daily:** Staging layer auto-updates (GitHub Actions)
+2. **Weekly:** Retrain models locally using fresh staging data
+3. **Monthly:** Evaluate model performance and adjust
+
+## Troubleshooting
+
+### Workflow Not Running
+
+**Check:**
+1. Workflow enabled in Actions tab?
+2. Repository has `ODDS_API_KEY` secret?
+3. Cron schedule correct? (14:00 UTC = 6 AM Pacific)
+4. GitHub Actions minutes available? (check usage in Settings)
+
+### Manual Run Succeeds, Scheduled Run Fails
+
+**Possible causes:**
+- Scheduled runs use `schedule` trigger, manual uses `workflow_dispatch`
+- Different permissions or secrets configuration
+- Check if commit step is causing issues (only runs on schedule)
+
+**Debug:**
+1. Review workflow logs for differences
+2. Test locally: `uv run python scripts/collect_hybrid.py`
+3. Verify secrets are available to scheduled runs
+
+### Staging Files Not Updating
+
+**Check:**
+1. Consolidate job succeeded?
+2. Commit step succeeded? (check logs)
+3. Are there merge conflicts? (check git status)
+4. Is staging directory in `.gitignore`? (should NOT be)
+
+## Advanced Configuration
+
+### Change Commit Behavior
+
+Edit workflow to **not** commit staging files:
+
+```yaml
+- name: Commit staging updates
+ if: false # Disable commits
+ run: |
+ ...
+```
+
+### Add Notifications
+
+Add Slack/Discord notification on success:
+
+```yaml
+- name: Notify success
+ if: success()
+ uses: slackapi/slack-github-action@v1
+ with:
+ webhook-url: ${{ secrets.SLACK_WEBHOOK }}
+ payload: |
+ {
+ "text": "Daily data pipeline succeeded"
+ }
+```
+
+### Run on Multiple Schedules
+
+Add multiple cron expressions:
+
+```yaml
+on:
+ schedule:
+ - cron: '0 14 * * *' # 6 AM PT daily
+ - cron: '0 22 * * *' # 2 PM PT daily (game days)
+```
+
+## References
+
+- GitHub Actions docs: https://docs.github.com/en/actions
+- Cron syntax: https://crontab.guru/
+- The Odds API: https://the-odds-api.com/
+- Staging layer: `scripts/consolidate_staging.py`
+- Collection script: `scripts/collect_hybrid.py`
diff --git a/docs/reference/overtime-ag-adapter.md b/docs/reference/overtime-ag-adapter.md
new file mode 100644
index 000000000..6df42416d
--- /dev/null
+++ b/docs/reference/overtime-ag-adapter.md
@@ -0,0 +1,260 @@
+# Overtime.ag SignalR Adapter
+
+Production-ready adapter for collecting real-time odds from Overtime.ag via SignalR WebSocket interception.
+
+## Architecture
+
+```
+Chrome Browser (--remote-debugging-port=9222)
+ |
+ | WebSocket (wss://ws.ticosports.com/signalr)
+ |
+ V
+Chrome DevTools Protocol (CDP)
+ |
+ | Network.webSocketFrameReceived events
+ |
+ V
+OvertimeSignalRClient
+ |
+ | Parse SignalR broadcastMessage
+ | Normalize line changes (per /normalize-odds skill)
+ |
+ V
+OvertimeSignalRLineChange (domain model)
+```
+
+## Components
+
+### Core Domain Model
+
+**`OvertimeSignalRLineChange`** (in `src/sports_betting_edge/core/models.py`)
+
+Normalized line change following `/normalize-odds` patterns:
+- `line_points`: Always positive magnitude (e.g., 6.5 not -6.5)
+- `side_role`: FAVORITE/UNDERDOG for spreads, OVER/UNDER for totals
+- `is_steam`: True when `ChangedBy="AutoMover"` (sharp action indicator)
+- `market_type`: SPREAD, TOTAL, or MONEYLINE enum
+- Full pricing: `money1`, `money2`, `decimal1`, `decimal2`
+
+### Adapter Layer
+
+**`OvertimeSignalRClient`** (in `src/sports_betting_edge/adapters/overtime_ag/signalr_client.py`)
+
+Async WebSocket client using Chrome DevTools Protocol:
+
+```python
+async with OvertimeSignalRClient() as client:
+ async for line_change in client.stream_line_changes(duration_seconds=3600):
+ print(f"{line_change.team} {line_change.line_points} [STEAM: {line_change.is_steam}]")
+```
+
+**Features**:
+- Automatic connection to Chrome CDP
+- SignalR message parsing (`gbsHub.broadcastMessage`)
+- Line normalization (magnitude-only, explicit roles)
+- Steam move detection
+- Structured logging
+- Type-safe with full mypy compliance
+
+### Operational Script
+
+**`scripts/collect_overtime_realtime.py`**
+
+Production collection script with Parquet output:
+
+```bash
+# Collect for 1 hour
+uv run python scripts/collect_overtime_realtime.py --duration 3600
+
+# Collect for full game window (3 hours)
+uv run python scripts/collect_overtime_realtime.py --duration 10800
+
+# Custom output location
+uv run python scripts/collect_overtime_realtime.py --output data/raw/overtime_lines.parquet
+```
+
+**Output Schema** (Parquet):
+- `timestamp`: When line changed (UTC)
+- `game_num`: Overtime.ag game number
+- `market_type`: SPREAD, TOTAL, MONEYLINE
+- `line_points`: Magnitude only (positive)
+- `side_role`: FAVORITE/UNDERDOG or OVER/UNDER
+- `team`: Team name (if available)
+- `money1`, `money2`: American odds both sides
+- `is_steam`: True if AutoMover
+- `captured_at`: When we captured it
+
+## Prerequisites
+
+1. **Chrome with Remote Debugging**:
+ ```powershell
+ chrome.exe --remote-debugging-port=9222 \
+ --user-data-dir=%USERPROFILE%\.chrome-profiles\overtime-ag
+ ```
+
+2. **overtime.ag Tab**:
+ - Open https://www.overtime.ag/sports#/
+ - Log in (session persists in profile)
+ - Navigate to Basketball -> College Basketball
+
+3. **Dependencies**:
+ ```bash
+ uv add websockets httpx
+ uv add --optional polars # For Parquet export
+ ```
+
+## Usage Examples
+
+### Basic Collection
+
+```python
+from sports_betting_edge.adapters.overtime_ag import OvertimeSignalRClient
+
+async def collect_lines():
+ async with OvertimeSignalRClient() as client:
+ async for line_change in client.stream_line_changes(duration_seconds=600):
+ if line_change.is_steam:
+ print(f"[STEAM] {line_change.team} moved to {line_change.line_points}")
+```
+
+### With Pattern Detection
+
+```python
+from sports_betting_edge.adapters.overtime_ag import OvertimeSignalRClient
+from sports_betting_edge.core.types import MarketType
+
+async def detect_steam_moves():
+ steam_moves = []
+
+ async with OvertimeSignalRClient() as client:
+ async for change in client.stream_line_changes(duration_seconds=3600):
+ if change.is_steam and change.market_type == MarketType.SPREAD:
+ steam_moves.append({
+ "team": change.team,
+ "line": change.line_points,
+ "timestamp": change.timestamp,
+ })
+
+ return steam_moves
+```
+
+### Save to Parquet
+
+```python
+import polars as pl
+from sports_betting_edge.adapters.overtime_ag import OvertimeSignalRClient
+
+async def collect_to_parquet():
+ line_changes = []
+
+ async with OvertimeSignalRClient() as client:
+ async for change in client.stream_line_changes(duration_seconds=3600):
+ line_changes.append(change.model_dump())
+
+ df = pl.DataFrame(line_changes)
+ df.write_parquet("data/overtime_lines.parquet")
+```
+
+## Integration with Services
+
+### Service Layer Pattern
+
+```python
+# src/sports_betting_edge/services/odds_collector.py
+
+from sports_betting_edge.adapters.overtime_ag import OvertimeSignalRClient
+
+class OddsCollector:
+ async def collect_realtime_lines(self, duration_seconds: int = 3600):
+ async with OvertimeSignalRClient() as client:
+ async for line_change in client.stream_line_changes(duration_seconds):
+ await self._process_line_change(line_change)
+```
+
+## Testing
+
+### Unit Tests
+
+Standard pytest async tests work:
+
+```python
+import pytest
+from sports_betting_edge.adapters.overtime_ag import OvertimeSignalRClient
+
+@pytest.mark.asyncio
+async def test_client_connection():
+ async with OvertimeSignalRClient() as client:
+ assert client._ws is not None
+```
+
+### Integration Tests
+
+Requires Chrome running:
+
+```bash
+# Skip Chrome-dependent tests
+uv run pytest tests/integration/adapters/test_overtime_signalr.py -m "not requires_chrome"
+
+# Run all tests (Chrome must be running)
+uv run pytest tests/integration/adapters/test_overtime_signalr.py -v
+```
+
+## Troubleshooting
+
+### Chrome Not Running
+
+**Error**: `ConfigurationError: Chrome not running with remote debugging`
+
+**Fix**:
+```powershell
+chrome.exe --remote-debugging-port=9222 \
+ --user-data-dir=%USERPROFILE%\.chrome-profiles\overtime-ag
+```
+
+### No overtime.ag Tab
+
+**Error**: `ConfigurationError: No overtime.ag tab found`
+
+**Fix**: Open https://www.overtime.ag/sports#/ in Chrome
+
+### Session Expired
+
+**Symptom**: Login screen appears
+
+**Fix**: Log in manually - session persists in dedicated profile
+
+### No LineChanges Captured
+
+**Symptom**: No line changes in stream
+
+**Fix**: Navigate to Basketball -> College Basketball to trigger SignalR subscription
+
+## Performance
+
+- **Message Volume**: 100-300 line changes/hour during peak windows
+- **WebSocket Overhead**: <1 MB/hour for line data
+- **Storage**: ~2 KB/change (normalized Parquet)
+- **Latency**: Sub-second from Overtime servers
+
+## Security & Ethics
+
+1. **Terms of Service**: Review Overtime.ag TOS before automated collection
+2. **Rate Limiting**: Real-time stream has natural rate limiting
+3. **Personal Use**: Collection for personal betting research is generally acceptable
+4. **No Redistribution**: Do not resell Overtime.ag data
+
+## Related Documentation
+
+- `/overtime-collecting` skill: Comprehensive collection methodology
+- `/normalize-odds` skill: Mandatory normalization patterns
+- [SignalR Protocol](https://github.com/dotnet/aspnetcore/tree/main/src/SignalR/docs/specs)
+- [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/)
+
+## Version History
+
+- **1.0.0** (2024-02-02): Initial production release
+ - OvertimeSignalRClient adapter
+ - OvertimeSignalRLineChange domain model
+ - Integration tests
+ - Collection script with Parquet export
diff --git a/docs/reference/overtime_line_feed_api.md b/docs/reference/overtime_line_feed_api.md
new file mode 100644
index 000000000..5f2f6f804
--- /dev/null
+++ b/docs/reference/overtime_line_feed_api.md
@@ -0,0 +1,301 @@
+# Overtime Line Feed API
+
+Real-time NCAA Men's Basketball line movement tracking API built with FastAPI.
+
+## Quick Start
+
+```bash
+# Start the API server
+uv run python scripts/start_line_feed_api.py
+
+# Visit the dashboard
+open http://127.0.0.1:8000
+
+# View API documentation
+open http://127.0.0.1:8000/docs
+```
+
+## Features
+
+### [OK] Real-Time Line Tracking
+- Compare opening odds to current lines
+- Track spread and total movements
+- Monitor juice changes (vig movement)
+- Filter by minimum movement thresholds
+
+### [OK] Dashboard
+- Auto-refreshing web interface (30-second intervals)
+- Color-coded movement indicators
+- Filterable by category (College Basketball / College Extra)
+- Snapshot count tracking
+- Responsive design
+
+### [OK] REST API
+- `/api/games` - List all tracked games
+- `/api/movements` - Get line movements with filters
+- `/api/game/{game_id}` - Full snapshot history for a game
+- `/api/stats` - Database statistics
+
+## Data Source
+
+The API reads from the SQLite database at:
+```
+data/overtime/overtime_lines.db
+```
+
+This database is populated by the `overtime_line_tracker.py` script running periodic captures.
+
+## Architecture
+
+```
+FastAPI App (src/sports_betting_edge/api/)
+│
+├── app.py Main FastAPI application
+├── models.py Pydantic response models
+├── routers/
+│ ├── health.py Health check endpoint
+│ └── line_movements.py Line tracking endpoints
+└── static/
+ └── index.html Dashboard UI
+```
+
+## API Endpoints
+
+### GET /api/games
+List all games with opening lines.
+
+**Query Parameters:**
+- `category` (optional): Filter by category (e.g., "College Basketball")
+
+**Response:**
+```json
+{
+ "games": [
+ {
+ "game_id": "college_basketball_601-602",
+ "category": "College Basketball",
+ "away_team": "Tennessee State",
+ "home_team": "UT Martin",
+ "game_time_str": "7:00 PM",
+ "game_date_str": "Feb 5",
+ "away_rotation": "601",
+ "home_rotation": "602"
+ }
+ ],
+ "total": 245
+}
+```
+
+### GET /api/movements
+Get line movements comparing opening to current lines.
+
+**Query Parameters:**
+- `category` (optional): Filter by category
+- `min_spread_movement` (optional): Minimum spread movement (absolute value)
+- `min_total_movement` (optional): Minimum total movement (absolute value)
+
+**Response:**
+```json
+[
+ {
+ "game": {
+ "game_id": "college_basketball_601-602",
+ "category": "College Basketball",
+ "away_team": "Tennessee State",
+ "home_team": "UT Martin",
+ ...
+ },
+ "opening": {
+ "spread_magnitude": 6.5,
+ "favorite_team": "UT Martin",
+ "spread_favorite_price": -110,
+ "spread_underdog_price": -110,
+ "total_points": 145.5,
+ "total_over_price": -110,
+ "total_under_price": -110,
+ "captured_at": "2026-02-05T12:00:00Z"
+ },
+ "current": {
+ "spread_magnitude": 7.0,
+ "favorite_team": "UT Martin",
+ "spread_favorite_price": -115,
+ "spread_underdog_price": -105,
+ "total_points": 146.0,
+ "total_over_price": -108,
+ "total_under_price": -112,
+ "captured_at": "2026-02-05T18:30:00Z"
+ },
+ "movement": {
+ "spread_movement": 0.5,
+ "total_movement": 0.5,
+ "spread_fav_juice_movement": -5,
+ "spread_dog_juice_movement": 5,
+ "total_over_juice_movement": 2,
+ "total_under_juice_movement": -2
+ },
+ "snapshots_count": 192
+ }
+]
+```
+
+### GET /api/game/{game_id}
+Get full snapshot history for a specific game.
+
+**Response:**
+```json
+{
+ "game": { ... },
+ "opening": { ... },
+ "snapshots": [
+ { "captured_at": "2026-02-05T12:00:00Z", ... },
+ { "captured_at": "2026-02-05T12:30:00Z", ... },
+ { "captured_at": "2026-02-05T13:00:00Z", ... }
+ ],
+ "total_snapshots": 192
+}
+```
+
+### GET /api/stats
+Get database statistics.
+
+**Response:**
+```json
+{
+ "total_games": 245,
+ "total_snapshots": 47085,
+ "games_with_movement": 245,
+ "last_update": "2026-02-05T18:30:00Z",
+ "database_url": "sqlite:///./data/overtime/overtime_lines.db"
+}
+```
+
+## Dashboard Interface
+
+The web dashboard provides:
+
+1. **Stats Overview**
+ - Total games tracked
+ - Total snapshots captured
+ - Games with movement
+ - Last update timestamp
+
+2. **Filtering**
+ - Category filter (College Basketball / College Extra)
+ - Minimum spread movement filter
+ - Minimum total movement filter
+
+3. **Game Cards**
+ - Matchup display (away @ home)
+ - Opening vs current odds comparison
+ - Movement indicators (▲ up, ▼ down)
+ - Color-coded significant movements (yellow border)
+ - Snapshot count
+ - Timestamps for opening and current lines
+
+4. **Auto-Refresh**
+ - Updates every 30 seconds automatically
+ - Live indicator shows refresh status
+
+## Integration with Line Tracker
+
+The API consumes data from the line tracking service. To populate data:
+
+```bash
+# One-shot capture
+uv run python scripts/overtime_line_tracker.py --once
+
+# Continuous capture (every 30 minutes)
+uv run python scripts/overtime_line_tracker.py --interval 30
+```
+
+**Recommended Setup:**
+1. Run line tracker continuously: `--interval 30`
+2. Start API server in separate terminal
+3. Access dashboard at http://127.0.0.1:8000
+
+## Deployment
+
+### Development
+```bash
+uv run python scripts/start_line_feed_api.py
+```
+
+### Production
+```bash
+# With multiple workers
+uv run python scripts/start_line_feed_api.py --prod --workers 4 --host 0.0.0.0 --port 8000
+```
+
+### Docker (Optional)
+```dockerfile
+FROM python:3.12-slim
+WORKDIR /app
+COPY . .
+RUN pip install uv && uv sync --no-dev
+EXPOSE 8000
+CMD ["uv", "run", "python", "scripts/start_line_feed_api.py", "--prod", "--host", "0.0.0.0"]
+```
+
+## Performance
+
+- **Database**: SQLite with indexed queries (game_id, captured_at)
+- **Response time**: < 100ms for most endpoints
+- **Memory**: ~50MB base + ~1MB per 1000 snapshots
+- **Concurrency**: Supports multiple concurrent requests
+
+## Security
+
+### Development
+- Runs on localhost (127.0.0.1) by default
+- CORS enabled for local development
+
+### Production
+- Bind to specific interface: `--host 0.0.0.0`
+- Use reverse proxy (nginx/Caddy) for HTTPS
+- Restrict CORS origins in `app.py`
+- Consider authentication for sensitive deployments
+
+## Troubleshooting
+
+### Database Not Found
+```
+[ERROR] Database not found
+```
+**Solution**: Run the line tracker to create and populate the database:
+```bash
+uv run python scripts/overtime_line_tracker.py --once
+```
+
+### No Games Found
+```
+No games found with current filters
+```
+**Solution**: Clear filters or adjust thresholds. Database may be empty - run line tracker.
+
+### Port Already in Use
+```
+[ERROR] error while attempting to bind on address ('127.0.0.1', 8000)
+```
+**Solution**: Use a different port:
+```bash
+uv run python scripts/start_line_feed_api.py --port 8001
+```
+
+## Future Enhancements
+
+Potential additions:
+- [ ] WebSocket for real-time push updates
+- [ ] Historical line charts (Chart.js)
+- [ ] Steam move detection and alerts
+- [ ] Export to CSV functionality
+- [ ] Comparison with other sportsbooks (Odds API integration)
+- [ ] Line movement velocity calculations
+- [ ] Sharp action indicators
+- [ ] Email/SMS alerts for significant movements
+
+## See Also
+
+- `scripts/overtime_line_tracker.py` - Line tracking scheduler
+- `src/sports_betting_edge/services/overtime_line_tracking.py` - Line tracking service
+- `scripts/README_overtime.md` - Overtime.ag integration guide
+- `/normalize-odds` skill - Odds data normalization patterns
diff --git a/docs/reference/team_mapping.md b/docs/reference/team_mapping.md
new file mode 100644
index 000000000..63c0b9df5
--- /dev/null
+++ b/docs/reference/team_mapping.md
@@ -0,0 +1,274 @@
+# Team Mapping System
+
+## Overview
+
+The team mapping system provides a canonical identifier for all NCAA Division I basketball teams, enabling accurate data joins across multiple data sources with different naming conventions.
+
+## Canonical Team Mapping Table
+
+**Location**: `data/processed/team_mapping.parquet`
+
+**Base Source**: KenPom (365 D-I teams, 2026 season)
+
+### Schema
+
+| Field | Type | Description | Source |
+|-------|------|-------------|--------|
+| `canonical_team_id` | int64 | Unique team identifier (KenPom ID) | KenPom |
+| `canonical_name` | string | Official canonical team name | KenPom |
+| `conference` | string | Conference abbreviation | KenPom |
+| `division` | string | Division (always "D1") | KenPom |
+| `kenpom_id` | int64 | KenPom team ID | KenPom |
+| `kenpom_name` | string | KenPom team name | KenPom |
+| `espn_id` | Int64 | ESPN team ID (nullable) | ESPN |
+| `espn_display_name` | string | ESPN display name (nullable) | ESPN |
+| `espn_abbreviation` | string | ESPN abbreviation (nullable) | ESPN |
+| `espn_slug` | string | ESPN URL slug (nullable) | ESPN |
+| `overtime_name` | string | Overtime.ag team name (nullable) | Overtime.ag |
+| `odds_api_name` | string | The Odds API team name (nullable) | The Odds API |
+
+## Data Source Coverage
+
+| Source | Teams Mapped | Coverage |
+|--------|-------------|----------|
+| **KenPom** | 365 / 365 | 100.0% (BASE) |
+| **ESPN** | 48 / 365 | 13.2% |
+| **Overtime.ag** | 193 / 365 | 52.9% |
+| **The Odds API** | 47 / 365 | 12.9% |
+
+## Common Name Variations
+
+Different data sources use different naming conventions. Examples:
+
+| Canonical (KenPom) | ESPN | Overtime.ag | The Odds API |
+|-------------------|------|-------------|--------------|
+| Arizona St. | Arizona State Sun Devils | Arizona State |
+| Connecticut | UConn Huskies | Connecticut |
+| UCF | Central Florida Knights | Central Florida |
+| N.C. State | NC State Wolfpack | NC State | NC State Wolfpack |
+| Louisiana St. | LSU Tigers | Louisiana State | Louisiana State Tigers |
+| Miami FL | Miami Hurricanes | Miami (FL) | Miami (FL) Hurricanes |
+| Alabama St. | - | Alabama State | Alabama St Hornets |
+| Cal Poly | Cal Poly Mustangs | Cal Poly SLO | - |
+| East Tennessee St. | - | East Tenn State | East Tennessee St Buccaneers |
+
+## Building the Mapping
+
+### 1. Initialize from KenPom (Base)
+
+```bash
+uv run python scripts/build_team_mapping.py
+```
+
+Creates the base mapping with all 365 D-I teams from KenPom.
+
+### 2. Map ESPN Teams
+
+```bash
+uv run python scripts/map_espn_teams.py
+```
+
+Maps ESPN team data using:
+- Manual mappings for common variations
+- Fuzzy string matching (85% threshold)
+
+**Result**: 48 teams mapped (100% match rate on ESPN data)
+
+### 3. Map Overtime.ag Teams
+
+```bash
+uv run python scripts/map_overtime_teams.py
+```
+
+Maps Overtime.ag team names using:
+- Manual mappings for abbreviations and variations
+- Exact string matching
+- Fuzzy matching as fallback (90% threshold)
+
+**Result**: 193 teams mapped (100% match rate on Overtime data)
+
+### 4. Collect Odds API Sample Data
+
+```bash
+uv run python scripts/collect_odds_api_sample.py
+```
+
+Fetches current NCAAB odds and extracts unique team names.
+
+**Cost**: 1-2 credits (depends on regions)
+
+### 5. Map The Odds API Teams
+
+```bash
+uv run python scripts/map_odds_api_teams.py
+```
+
+Maps Odds API team names using:
+- Manual mappings for full names with mascots
+- Mascot stripping for fuzzy matching
+- Core team name extraction
+
+**Result**: 47 teams mapped (100% match rate on Odds API data)
+
+## Usage Examples
+
+### Join KenPom metrics with Overtime.ag odds
+
+```python
+import pandas as pd
+
+# Load team mapping
+mapping = pd.read_parquet('data/processed/team_mapping.parquet')
+
+# Load KenPom efficiency data
+kenpom_df = pd.read_parquet('data/kenpom/teams/season/teams_2026.parquet')
+
+# Load Overtime odds data
+overtime_df = pd.read_parquet('data/overtime/2026-01-31.parquet')
+
+# Join Overtime home team to canonical mapping
+home_joined = overtime_df.merge(
+ mapping[['overtime_name', 'canonical_team_id', 'kenpom_id']],
+ left_on='home_team',
+ right_on='overtime_name',
+ how='left'
+)
+
+# Now join to KenPom metrics
+final_df = home_joined.merge(
+ kenpom_df,
+ left_on='kenpom_id',
+ right_on='TeamID',
+ how='left'
+)
+```
+
+### Validate matchup consistency
+
+```python
+import pandas as pd
+
+mapping = pd.read_parquet('data/processed/team_mapping.parquet')
+overtime_df = pd.read_parquet('data/overtime/2026-01-31.parquet')
+
+# Check for unmatched teams
+home_unmatched = overtime_df[
+ ~overtime_df['home_team'].isin(mapping['overtime_name'])
+]
+away_unmatched = overtime_df[
+ ~overtime_df['away_team'].isin(mapping['overtime_name'])
+]
+
+if len(home_unmatched) > 0 or len(away_unmatched) > 0:
+ print("WARNING: Unmatched teams found!")
+ print(f"Home teams: {home_unmatched['home_team'].unique()}")
+ print(f"Away teams: {away_unmatched['away_team'].unique()}")
+else:
+ print("All teams matched successfully!")
+```
+
+## Timezone Handling
+
+Different data sources use different timezone conventions:
+
+| Source | Format | Example | Notes |
+|--------|--------|---------|-------|
+| KenPom | - | - | No timestamps in team data |
+| ESPN | Various | - | Check specific endpoint |
+| Overtime.ag | ISO 8601 UTC | `2026-01-31T13:55:20.225134Z` | 'Z' suffix = UTC |
+| The Odds API | ISO 8601 UTC | `2026-01-31T20:00:00Z` | 'Z' suffix = UTC |
+
+### Converting Overtime.ag timestamps
+
+```python
+import pandas as pd
+
+df = pd.read_parquet('data/overtime/2026-01-31.parquet')
+
+# Parse UTC timestamp
+df['captured_at_utc'] = pd.to_datetime(df['captured_at'], utc=True)
+
+# Convert to local timezone (e.g., Pacific)
+df['captured_at_local'] = df['captured_at_utc'].dt.tz_convert('America/Los_Angeles')
+
+# For game times, combine date and time strings
+df['game_datetime'] = pd.to_datetime(
+ df['game_date_str'] + ' ' + df['game_time_str'],
+ format='%a %b %d %I:%M %p'
+)
+```
+
+## Maintenance
+
+### Adding New Manual Mappings
+
+If new teams appear in data sources with unmatched names:
+
+1. Identify the canonical KenPom name
+2. Add to `MANUAL_*_MAPPINGS` dict in the appropriate script
+3. Re-run the mapping script
+
+Example for Overtime.ag:
+
+```python
+# In scripts/map_overtime_teams.py
+MANUAL_OVERTIME_MAPPINGS = {
+ # ... existing mappings ...
+ "New Team Name": "Canonical KenPom Name",
+}
+```
+
+### Updating for New Season
+
+When a new season starts:
+
+1. Collect new KenPom team data for the season
+2. Run `build_team_mapping.py` with new season year
+3. Re-run all mapping scripts to update coverage
+
+```bash
+# In scripts/build_team_mapping.py, update:
+kenpom_df = load_kenpom_teams(season=2027) # New season
+```
+
+## Quality Checks
+
+### Validate Complete Coverage
+
+```python
+import pandas as pd
+
+mapping = pd.read_parquet('data/processed/team_mapping.parquet')
+
+# Check for teams without any external mappings
+no_external = mapping[
+ mapping['espn_id'].isna() &
+ mapping['overtime_name'].isna() &
+ mapping['odds_api_name'].isna()
+]
+
+print(f"Teams with no external data sources: {len(no_external)}")
+print(f"Conferences affected: {no_external['conference'].value_counts()}")
+```
+
+### Detect Duplicates
+
+```python
+import pandas as pd
+
+mapping = pd.read_parquet('data/processed/team_mapping.parquet')
+
+# Check for duplicate KenPom IDs
+dupes = mapping[mapping.duplicated(['kenpom_id'], keep=False)]
+if len(dupes) > 0:
+ print(f"WARNING: {len(dupes)} duplicate KenPom IDs found")
+ print(dupes[['canonical_team_id', 'kenpom_id', 'canonical_name']])
+```
+
+## Future Enhancements
+
+1. **Historical Team Names**: Track team name changes over seasons
+2. **Alternate Names**: Store common abbreviations and variations
+3. **Conference Changes**: Track conference realignment over time
+4. **Validation Suite**: Automated tests for mapping integrity
+5. **API Integration**: Provide REST endpoint for team lookups
diff --git a/docs/reference/walk-forward-validation.md b/docs/reference/walk-forward-validation.md
new file mode 100644
index 000000000..80ebfd9e9
--- /dev/null
+++ b/docs/reference/walk-forward-validation.md
@@ -0,0 +1,218 @@
+# Walk-Forward Validation for Sports Betting Models
+
+## Overview
+
+Walk-forward validation is critical for sports betting models because it:
+1. **Respects temporal ordering** - No lookahead bias
+2. **Simulates production** - Train on past, predict future
+3. **Reveals overfitting** - Tests on truly unseen data
+
+## Why Random Train/Test Split Fails
+
+Random splitting violates temporal causality:
+```python
+# WRONG: Random split
+X_train, X_test = train_test_split(X, y, test_size=0.2, random_state=42)
+# Problem: Test set may contain games from BEFORE training games
+```
+
+This creates lookahead bias where the model "sees the future" during training.
+
+## Walk-Forward Approach
+
+Train on chronologically earlier data, test on chronologically later data:
+
+```
+Timeline: [----Training Period----][----Test Period----]
+ Nov - Dec - Jan 15 Jan 16 - Jan 31
+
+Train: Games from Nov-Jan 15
+Test: Games from Jan 16-31 (strictly AFTER training period)
+```
+
+## Current Implementation
+
+### Training Script
+
+```bash
+uv run python scripts/train_walkforward.py \\
+ --model-type spreads \\
+ --train-start 2026-01-24 \\
+ --train-end 2026-01-27 \\
+ --test-start 2026-01-28 \\
+ --test-end 2026-01-31 \\
+ --output models/spreads_walkforward.json
+```
+
+### Validation Rules
+
+The script enforces:
+1. `test_start` must be > `train_end` (no overlap)
+2. Features are aligned between train/test sets
+3. Datasets built separately for each period
+4. Temporal ordering strictly maintained
+
+## Results (Jan 2026 Data)
+
+### Spreads Model
+```
+Training: 214 games (Jan 24-27)
+Test: 201 games (Jan 28-31)
+
+Train Accuracy: 100.00% <- Severe overfitting!
+Test Accuracy: 56.72%
+Train AUC: 1.0000
+Test AUC: 0.4636 <- Worse than random (0.50)
+```
+
+**Top Features**: fav_adj_em, fav_o_vs_dog_d, fav_adj_o
+
+### Totals Model
+```
+Training: 154 games (Jan 24-27)
+Test: 234 games (Jan 28-31)
+
+Train Accuracy: 100.00% <- Severe overfitting!
+Test Accuracy: 55.13%
+Train AUC: 1.0000
+Test AUC: 0.5661 <- Slightly better than random
+```
+
+**Top Features**: home_adj_em, away_dto_pct, away_defg_pct, total_defense
+
+## Analysis
+
+### Severe Overfitting Detected
+
+Both models show:
+- **100% train accuracy** - Memorizing training data
+- **Test AUC ≈ 0.5** - No better than coin flip
+- **Limited data** - Only 154-214 training games
+
+This is expected because:
+1. Small dataset (need 1,000+ games minimum)
+2. High feature count (30-31 features)
+3. Default hyperparameters (no regularization)
+4. Limited temporal coverage (only 4 days training)
+
+### Walk-Forward Reveals Truth
+
+Random split would have shown better (but misleading) results due to lookahead bias.
+Walk-forward properly shows the model isn't learning generalizable patterns.
+
+## Data Availability
+
+Current Odds API database:
+- **Odds data**: 2025-12-28 onwards (observations table)
+- **Scores**: 2026-01-24 to 2026-01-31 only (369 games)
+
+### Limitation
+
+Can only use data where we have BOTH odds and scores:
+- Training period: Jan 24-27 (first 4 days with scores)
+- Test period: Jan 28-31 (next 4 days)
+
+This is insufficient for production models. Need full season coverage.
+
+## Next Steps
+
+### 1. Collect More Data (Critical)
+
+Continue collecting Odds API data through March 2026:
+```bash
+# Daily collection
+uv run python scripts/collect_odds_stream.py \\
+ --sport basketball_ncaab \\
+ --interval 30 \\
+ --regions us,us2 \\
+ --markets h2h,spreads,totals
+```
+
+Target dataset size: 3,000+ games (full season Nov-Mar)
+
+### 2. Implement Rolling Walk-Forward
+
+As more data accumulates, implement rolling window:
+
+```
+Window 1: Train Nov-Dec, Test Jan Week 1
+Window 2: Train Nov-Jan, Test Jan Week 2
+Window 3: Train Nov-Feb, Test Feb Week 1
+...
+```
+
+This gives multiple test periods for robust evaluation.
+
+### 3. Add Regularization
+
+Current model overfits due to no regularization. Try:
+```python
+XGBClassifier(
+ n_estimators=50, # Reduce from 100
+ max_depth=3, # Reduce from 5
+ min_child_weight=5, # Add minimum samples per leaf
+ learning_rate=0.05, # Slower learning
+ subsample=0.8, # Row sampling
+ colsample_bytree=0.8, # Feature sampling
+ reg_alpha=1.0, # L1 regularization
+ reg_lambda=1.0, # L2 regularization
+)
+```
+
+### 4. Feature Selection
+
+Drop low-importance features (< 0.02 importance):
+- Currently using 30-31 features
+- Target: 15-20 most important features
+- Reduces overfitting risk
+
+### 5. Evaluate CLV, Not Just Accuracy
+
+Accuracy is misleading for betting. Use:
+- **Closing Line Value (CLV)**: Beat closing lines consistently
+- **ROI**: Return on investment vs flat betting
+- **Calibration**: Do predicted probabilities match reality?
+
+Example:
+```python
+# Model predicts: 60% favorite covers
+# Closing line implies: 55% favorite covers
+# CLV = +5% (value bet if model is correct)
+```
+
+## Production Workflow (Future)
+
+Once sufficient data is collected:
+
+```bash
+# 1. Build datasets for new period
+uv run python scripts/build_training_datasets.py \\
+ --start 2026-02-01 --end 2026-02-15
+
+# 2. Train with walk-forward
+uv run python scripts/train_walkforward.py \\
+ --model-type spreads \\
+ --train-start 2025-11-01 \\
+ --train-end 2026-01-31 \\
+ --test-start 2026-02-01 \\
+ --test-end 2026-02-15
+
+# 3. Evaluate CLV on test period
+# 4. If CLV positive, deploy for next period
+# 5. Retrain weekly with new data
+```
+
+## Key Takeaways
+
+1. ✅ **Walk-forward validation implemented** - No lookahead bias
+2. ⚠️ **Models currently overfit** - 100% train accuracy, ~56% test accuracy
+3. ⚠️ **Limited data** - Only 4 days training, need full season
+4. ✅ **Framework ready** - Can retrain as more data comes in
+5. 🎯 **Focus on CLV** - Accuracy is secondary to beating closing lines
+
+## References
+
+- Training script: `scripts/train_walkforward.py`
+- Feature engineering: `src/sports_betting_edge/services/feature_engineering.py`
+- Odds collection: `scripts/collect_odds_stream.py`
+- Current models: `models/spreads_walkforward.json`, `models/totals_walkforward.json`
diff --git a/odds/README.md b/odds/README.md
new file mode 100644
index 000000000..5b70f054d
--- /dev/null
+++ b/odds/README.md
@@ -0,0 +1,41 @@
+# Odds pipeline (Postgres + GitHub Actions)
+
+This folder contains a minimal, repo-local odds data pipeline designed for:
+- **rolling 5-day freshness windows**
+- strict **canonical normalization** (spreads/totals/moneylines)
+- scheduled orchestration via **GitHub Actions**
+
+## Required secrets / env vars
+
+- `DATABASE_URL` (required): Postgres connection string (persistent store).
+- `ODDS_API_KEY` (collector key; required once collectors are enabled).
+- `KENPOM_EMAIL` / `KENPOM_PASSWORD` (required for KenPom scraping via kenpompy).
+
+Optional:
+- `WINDOW_DAYS` (default 5)
+- `ODDS_STALE_MINUTES` (default 180)
+- `SCORES_STALE_HOURS` (default 24)
+
+## Additional sources
+
+- **ESPN**: schedules + historical scores via the public scoreboard endpoint (stored in `raw_games_snapshots`).
+- **Action Network**: best-effort HTML scrape from `actionnetwork.com` to supplement game IDs/links (stored in `raw_games_snapshots`).
+- **KenPom**: team metrics scraped via `kenpompy` (stored in `raw_kenpom_team_metrics`).
+
+## Initialize schema
+
+From repo root, using **pip**:
+
+```bash
+python -m pip install -e odds
+python -m odds_pipeline.schema
+```
+
+Or with **uv**:
+
+```bash
+cd odds
+uv sync
+uv run python -m odds_pipeline.schema
+```
+
diff --git a/odds/pyproject.toml b/odds/pyproject.toml
new file mode 100644
index 000000000..b206167a4
--- /dev/null
+++ b/odds/pyproject.toml
@@ -0,0 +1,23 @@
+[build-system]
+requires = ["setuptools>=68"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "odds-pipeline"
+version = "0.1.0"
+description = "Odds collection/normalization pipeline for rolling windows"
+requires-python = ">=3.11"
+dependencies = [
+ "psycopg[binary]>=3.2.0",
+ "requests>=2.32.0",
+ "pydantic>=2.10.0",
+ "kenpompy>=0.3.5",
+ "pandas>=2.2.0",
+]
+
+[tool.setuptools]
+package-dir = {"" = "src"}
+
+[tool.setuptools.packages.find]
+where = ["src"]
+
diff --git a/odds/requirements.txt b/odds/requirements.txt
new file mode 100644
index 000000000..3d1444335
--- /dev/null
+++ b/odds/requirements.txt
@@ -0,0 +1,6 @@
+psycopg[binary]>=3.2.0
+requests>=2.32.0
+pydantic>=2.10.0
+kenpompy>=0.3.5
+pandas>=2.2.0
+
diff --git a/odds/sql/schema.sql b/odds/sql/schema.sql
new file mode 100644
index 000000000..58fe2392e
--- /dev/null
+++ b/odds/sql/schema.sql
@@ -0,0 +1,164 @@
+-- Odds pipeline persistent schema (Postgres)
+-- All timestamps are stored as timestamptz (UTC).
+
+CREATE TABLE IF NOT EXISTS raw_odds_snapshots (
+ id BIGSERIAL PRIMARY KEY,
+ source TEXT NOT NULL,
+ sport TEXT NOT NULL,
+ event_id TEXT NOT NULL,
+ commence_time TIMESTAMPTZ,
+ home_team TEXT,
+ away_team TEXT,
+ bookmaker_key TEXT,
+ market_key TEXT NOT NULL,
+ outcome_name TEXT NOT NULL,
+ price INTEGER,
+ point NUMERIC,
+ collected_at TIMESTAMPTZ NOT NULL,
+ raw JSONB,
+ UNIQUE (source, event_id, bookmaker_key, market_key, outcome_name, collected_at)
+);
+
+CREATE INDEX IF NOT EXISTS raw_odds_snapshots_collected_at_idx
+ ON raw_odds_snapshots (collected_at DESC);
+
+CREATE INDEX IF NOT EXISTS raw_odds_snapshots_event_idx
+ ON raw_odds_snapshots (sport, event_id);
+
+CREATE TABLE IF NOT EXISTS raw_scores_snapshots (
+ id BIGSERIAL PRIMARY KEY,
+ source TEXT NOT NULL,
+ sport TEXT NOT NULL,
+ event_id TEXT NOT NULL,
+ commence_time TIMESTAMPTZ,
+ home_team TEXT,
+ away_team TEXT,
+ completed BOOLEAN,
+ home_score INTEGER,
+ away_score INTEGER,
+ last_update TIMESTAMPTZ,
+ collected_at TIMESTAMPTZ NOT NULL,
+ raw JSONB,
+ UNIQUE (source, event_id, collected_at)
+);
+
+CREATE INDEX IF NOT EXISTS raw_scores_snapshots_collected_at_idx
+ ON raw_scores_snapshots (collected_at DESC);
+
+CREATE INDEX IF NOT EXISTS raw_scores_snapshots_event_idx
+ ON raw_scores_snapshots (sport, event_id);
+
+-- Generic external games feed (schedules + scores) from public endpoints like ESPN
+-- and HTML-scraped sources like Action Network. This is intentionally flexible.
+CREATE TABLE IF NOT EXISTS raw_games_snapshots (
+ id BIGSERIAL PRIMARY KEY,
+ source TEXT NOT NULL, -- e.g. 'espn', 'actionnetwork'
+ sport TEXT NOT NULL,
+ external_event_id TEXT NOT NULL,
+ commence_time TIMESTAMPTZ,
+ home_team TEXT,
+ away_team TEXT,
+ status TEXT,
+ home_score INTEGER,
+ away_score INTEGER,
+ collected_at TIMESTAMPTZ NOT NULL,
+ raw JSONB,
+ UNIQUE (source, sport, external_event_id, collected_at)
+);
+
+CREATE INDEX IF NOT EXISTS raw_games_snapshots_collected_at_idx
+ ON raw_games_snapshots (collected_at DESC);
+
+CREATE INDEX IF NOT EXISTS raw_games_snapshots_event_idx
+ ON raw_games_snapshots (sport, external_event_id);
+
+-- KenPom metrics snapshots (scraped via kenpompy; one row per team per scrape timestamp).
+CREATE TABLE IF NOT EXISTS raw_kenpom_team_metrics (
+ id BIGSERIAL PRIMARY KEY,
+ season INTEGER NOT NULL,
+ team TEXT NOT NULL,
+ metric_type TEXT NOT NULL, -- 'pomeroy_ratings', 'efficiency', 'four_factors', 'fanmatch'
+ collected_at TIMESTAMPTZ NOT NULL,
+ raw JSONB NOT NULL,
+ UNIQUE (season, team, metric_type, collected_at)
+);
+
+CREATE INDEX IF NOT EXISTS raw_kenpom_team_metrics_collected_at_idx
+ ON raw_kenpom_team_metrics (collected_at DESC);
+
+-- Canonical spreads: one row per event/book/collected_at.
+-- spread_magnitude is always positive; favorite/underdog teams are explicit.
+CREATE TABLE IF NOT EXISTS canonical_spreads (
+ id BIGSERIAL PRIMARY KEY,
+ sport TEXT NOT NULL,
+ event_id TEXT NOT NULL,
+ commence_time TIMESTAMPTZ,
+ bookmaker_key TEXT NOT NULL,
+ favorite_team TEXT NOT NULL,
+ underdog_team TEXT NOT NULL,
+ spread_magnitude NUMERIC NOT NULL,
+ favorite_price INTEGER,
+ underdog_price INTEGER,
+ collected_at TIMESTAMPTZ NOT NULL,
+ UNIQUE (event_id, bookmaker_key, collected_at, spread_magnitude)
+);
+
+CREATE INDEX IF NOT EXISTS canonical_spreads_collected_at_idx
+ ON canonical_spreads (collected_at DESC);
+
+CREATE INDEX IF NOT EXISTS canonical_spreads_event_idx
+ ON canonical_spreads (sport, event_id);
+
+-- Canonical totals: one row per event/book/collected_at.
+CREATE TABLE IF NOT EXISTS canonical_totals (
+ id BIGSERIAL PRIMARY KEY,
+ sport TEXT NOT NULL,
+ event_id TEXT NOT NULL,
+ commence_time TIMESTAMPTZ,
+ bookmaker_key TEXT NOT NULL,
+ total NUMERIC NOT NULL,
+ over_price INTEGER,
+ under_price INTEGER,
+ collected_at TIMESTAMPTZ NOT NULL,
+ UNIQUE (event_id, bookmaker_key, collected_at, total)
+);
+
+CREATE INDEX IF NOT EXISTS canonical_totals_collected_at_idx
+ ON canonical_totals (collected_at DESC);
+
+CREATE INDEX IF NOT EXISTS canonical_totals_event_idx
+ ON canonical_totals (sport, event_id);
+
+-- Canonical moneylines: store prices + implied probabilities.
+CREATE TABLE IF NOT EXISTS canonical_moneylines (
+ id BIGSERIAL PRIMARY KEY,
+ sport TEXT NOT NULL,
+ event_id TEXT NOT NULL,
+ commence_time TIMESTAMPTZ,
+ bookmaker_key TEXT NOT NULL,
+ home_team TEXT NOT NULL,
+ away_team TEXT NOT NULL,
+ home_price INTEGER,
+ away_price INTEGER,
+ home_implied_prob DOUBLE PRECISION,
+ away_implied_prob DOUBLE PRECISION,
+ collected_at TIMESTAMPTZ NOT NULL,
+ UNIQUE (event_id, bookmaker_key, collected_at)
+);
+
+CREATE INDEX IF NOT EXISTS canonical_moneylines_collected_at_idx
+ ON canonical_moneylines (collected_at DESC);
+
+CREATE INDEX IF NOT EXISTS canonical_moneylines_event_idx
+ ON canonical_moneylines (sport, event_id);
+
+-- Pipeline run log for auditing and freshness debugging.
+CREATE TABLE IF NOT EXISTS pipeline_runs (
+ id BIGSERIAL PRIMARY KEY,
+ job_name TEXT NOT NULL,
+ started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
+ finished_at TIMESTAMPTZ,
+ status TEXT NOT NULL,
+ details JSONB
+);
+
diff --git a/odds/src/odds_pipeline/__init__.py b/odds/src/odds_pipeline/__init__.py
new file mode 100644
index 000000000..fe16459e0
--- /dev/null
+++ b/odds/src/odds_pipeline/__init__.py
@@ -0,0 +1,2 @@
+__all__ = []
+
diff --git a/odds/src/odds_pipeline/__main__.py b/odds/src/odds_pipeline/__main__.py
new file mode 100644
index 000000000..9a16077bd
--- /dev/null
+++ b/odds/src/odds_pipeline/__main__.py
@@ -0,0 +1,7 @@
+from __future__ import annotations
+
+from odds_pipeline.cli import main
+
+if __name__ == "__main__":
+ raise SystemExit(main())
+
diff --git a/odds/src/odds_pipeline/backfill.py b/odds/src/odds_pipeline/backfill.py
new file mode 100644
index 000000000..e34fd6dbb
--- /dev/null
+++ b/odds/src/odds_pipeline/backfill.py
@@ -0,0 +1,29 @@
+from __future__ import annotations
+
+from odds_pipeline.collect_odds import collect_odds
+from odds_pipeline.collect_scores import collect_scores
+from odds_pipeline.normalize import normalize_window
+
+
+def backfill(
+ *,
+ sport: str,
+ lookback_days: int = 5,
+ regions: str = "us",
+ markets: str = "h2h,spreads,totals",
+) -> dict[str, int]:
+ """
+ Bounded backfill for the rolling window.
+
+ Notes:
+ - Scores support a historical lookback via daysFrom.
+ - Odds endpoints are typically current/upcoming; backfill here re-collects the latest
+ and then re-normalizes to restore canonical tables.
+ """
+ out: dict[str, int] = {}
+ out["scores_inserted"] = collect_scores(sport=sport, days_from=lookback_days)
+ out["odds_inserted"] = collect_odds(sport=sport, regions=regions, markets=markets)
+ counts = normalize_window(window_days=lookback_days)
+ out.update({f"canonical_{k}": v for k, v in counts.items()})
+ return out
+
diff --git a/odds/src/odds_pipeline/cli.py b/odds/src/odds_pipeline/cli.py
new file mode 100644
index 000000000..a17982a63
--- /dev/null
+++ b/odds/src/odds_pipeline/cli.py
@@ -0,0 +1,174 @@
+from __future__ import annotations
+
+import argparse
+import json
+
+from odds_pipeline.collect_odds import collect_odds
+from odds_pipeline.collect_scores import collect_scores
+from odds_pipeline.backfill import backfill
+from odds_pipeline.collect_espn import collect_espn_last_days
+from odds_pipeline.collect_action_network import collect_action_network_odds_page
+from odds_pipeline.collect_kenpom import collect_kenpom_team_metrics
+from odds_pipeline.freshness import check_freshness
+from odds_pipeline.normalize import normalize_window
+from odds_pipeline.predict import predict
+from odds_pipeline.schema import init_schema
+from odds_pipeline.train import train
+from odds_pipeline.validate import main as validate_main
+
+
+def _cmd_init_schema(_: argparse.Namespace) -> int:
+ init_schema()
+ return 0
+
+
+def _cmd_collect_odds(args: argparse.Namespace) -> int:
+ inserted = collect_odds(
+ sport=args.sport,
+ regions=args.regions,
+ markets=args.markets,
+ odds_format=args.odds_format,
+ date_format=args.date_format,
+ )
+ print(json.dumps({"inserted": inserted}))
+ return 0
+
+
+def _cmd_collect_scores(args: argparse.Namespace) -> int:
+ inserted = collect_scores(sport=args.sport, days_from=args.days_from, date_format=args.date_format)
+ print(json.dumps({"inserted": inserted}))
+ return 0
+
+
+def _cmd_normalize(args: argparse.Namespace) -> int:
+ counts = normalize_window(window_days=args.window_days)
+ print(json.dumps(counts))
+ return 0
+
+
+def _cmd_freshness_guard(args: argparse.Namespace) -> int:
+ result = check_freshness(window_days=args.window_days)
+ print(json.dumps({"ok": result.ok, "details": result.details}))
+ return 0 if result.ok else 2
+
+
+def _cmd_train(args: argparse.Namespace) -> int:
+ result = train(window_days=args.window_days)
+ print(json.dumps(result.__dict__))
+ return 0
+
+
+def _cmd_predict(args: argparse.Namespace) -> int:
+ artifact = predict(model_version=args.model_version, window_days=args.window_days, limit=args.limit)
+ print(json.dumps({"model_version": artifact.model_version, "generated_at": artifact.generated_at, "rows": len(artifact.sample)}))
+ return 0
+
+
+def _cmd_validate(_: argparse.Namespace) -> int:
+ validate_main()
+ print(json.dumps({"ok": True}))
+ return 0
+
+
+def _cmd_backfill(args: argparse.Namespace) -> int:
+ result = backfill(
+ sport=args.sport,
+ lookback_days=args.lookback_days,
+ regions=args.regions,
+ markets=args.markets,
+ )
+ print(json.dumps(result))
+ return 0
+
+
+def _cmd_collect_espn(args: argparse.Namespace) -> int:
+ inserted = collect_espn_last_days(lookback_days=args.lookback_days)
+ print(json.dumps({"inserted": inserted}))
+ return 0
+
+
+def _cmd_collect_action(args: argparse.Namespace) -> int:
+ inserted = collect_action_network_odds_page()
+ print(json.dumps({"inserted": inserted}))
+ return 0
+
+
+def _cmd_collect_kenpom(args: argparse.Namespace) -> int:
+ inserted = collect_kenpom_team_metrics(season=args.season, metric_type=args.metric_type)
+ print(json.dumps({"inserted": inserted}))
+ return 0
+
+
+def build_parser() -> argparse.ArgumentParser:
+ p = argparse.ArgumentParser(prog="odds-pipeline")
+ sub = p.add_subparsers(required=True)
+
+ s = sub.add_parser("init-schema", help="Initialize Postgres schema")
+ s.set_defaults(func=_cmd_init_schema)
+
+ s = sub.add_parser("collect-odds", help="Collect odds snapshots")
+ s.add_argument("--sport", required=True)
+ s.add_argument("--regions", default="us")
+ s.add_argument("--markets", default="h2h,spreads,totals")
+ s.add_argument("--odds-format", default="american")
+ s.add_argument("--date-format", default="iso")
+ s.set_defaults(func=_cmd_collect_odds)
+
+ s = sub.add_parser("collect-scores", help="Collect rolling scores")
+ s.add_argument("--sport", required=True)
+ s.add_argument("--days-from", type=int, default=5)
+ s.add_argument("--date-format", default="iso")
+ s.set_defaults(func=_cmd_collect_scores)
+
+ s = sub.add_parser("normalize", help="Normalize raw -> canonical for window")
+ s.add_argument("--window-days", type=int, default=None)
+ s.set_defaults(func=_cmd_normalize)
+
+ s = sub.add_parser("freshness-guard", help="Fail if odds/scores are stale for window")
+ s.add_argument("--window-days", type=int, default=None)
+ s.set_defaults(func=_cmd_freshness_guard)
+
+ s = sub.add_parser("train", help="Train model artifacts (placeholder baseline)")
+ s.add_argument("--window-days", type=int, default=None)
+ s.set_defaults(func=_cmd_train)
+
+ s = sub.add_parser("predict", help="Generate prediction artifacts (placeholder baseline)")
+ s.add_argument("--model-version", required=True)
+ s.add_argument("--window-days", type=int, default=None)
+ s.add_argument("--limit", type=int, default=50)
+ s.set_defaults(func=_cmd_predict)
+
+ s = sub.add_parser("validate", help="Run fast invariants/normalization checks")
+ s.set_defaults(func=_cmd_validate)
+
+ s = sub.add_parser("backfill", help="Bounded rolling-window backfill and re-normalize")
+ s.add_argument("--sport", required=True)
+ s.add_argument("--lookback-days", type=int, default=5)
+ s.add_argument("--regions", default="us")
+ s.add_argument("--markets", default="h2h,spreads,totals")
+ s.set_defaults(func=_cmd_backfill)
+
+ s = sub.add_parser("collect-espn", help="Collect ESPN schedules/scores for last N days")
+ s.add_argument("--lookback-days", type=int, default=5)
+ s.set_defaults(func=_cmd_collect_espn)
+
+ s = sub.add_parser("collect-action-network", help="Collect Action Network NCAAB odds page (best-effort)")
+ s.set_defaults(func=_cmd_collect_action)
+
+ s = sub.add_parser("collect-kenpom", help="Collect KenPom team metrics via kenpompy")
+ s.add_argument("--season", type=int, default=2026)
+ s.add_argument("--metric-type", choices=["pomeroy_ratings", "efficiency", "four_factors"], default="pomeroy_ratings")
+ s.set_defaults(func=_cmd_collect_kenpom)
+
+ return p
+
+
+def main(argv: list[str] | None = None) -> int:
+ parser = build_parser()
+ args = parser.parse_args(argv)
+ return int(args.func(args))
+
+
+if __name__ == "__main__":
+ raise SystemExit(main())
+
diff --git a/odds/src/odds_pipeline/collect_action_network.py b/odds/src/odds_pipeline/collect_action_network.py
new file mode 100644
index 000000000..af6b8ffd9
--- /dev/null
+++ b/odds/src/odds_pipeline/collect_action_network.py
@@ -0,0 +1,69 @@
+from __future__ import annotations
+
+import json
+import re
+from dataclasses import dataclass
+from datetime import datetime
+
+import requests
+
+from odds_pipeline.config import load_settings
+from odds_pipeline.db import connect
+from odds_pipeline.util import now_utc
+
+
+@dataclass(frozen=True)
+class ActionNetworkConfig:
+ # This is a public web page; we treat it as a best-effort HTML feed.
+ odds_url: str = "https://www.actionnetwork.com/ncaab/odds"
+
+
+_GAME_URL_RE = re.compile(r"https://www\.actionnetwork\.com/ncaab-game/[^)\s]+/(\d+)")
+
+
+def collect_action_network_odds_page() -> int:
+ """
+ Best-effort collection from Action Network's NCAAB odds page.
+
+ This is **not** a stable public API; we store raw HTML and extracted game IDs/URLs
+ as a schedules/scores supplement.
+ """
+ settings = load_settings()
+ cfg = ActionNetworkConfig()
+
+ resp = requests.get(cfg.odds_url, timeout=60, headers={"User-Agent": "Mozilla/5.0"})
+ resp.raise_for_status()
+ html = resp.text
+
+ collected_at = now_utc()
+ matches = list(_GAME_URL_RE.finditer(html))
+ inserted = 0
+
+ with connect(settings.database_url) as conn:
+ with conn.cursor() as cur:
+ for m in matches:
+ game_id = m.group(1)
+ url = m.group(0)
+ raw = {"url": url, "html_len": len(html)}
+ cur.execute(
+ """
+ INSERT INTO raw_games_snapshots (
+ source, sport, external_event_id, collected_at, raw
+ ) VALUES (
+ %(source)s, %(sport)s, %(external_event_id)s, %(collected_at)s, %(raw)s
+ )
+ ON CONFLICT DO NOTHING
+ """,
+ {
+ "source": "actionnetwork",
+ "sport": "basketball_ncaab",
+ "external_event_id": str(game_id),
+ "collected_at": collected_at,
+ "raw": json.dumps(raw),
+ },
+ )
+ inserted += cur.rowcount
+ conn.commit()
+
+ return inserted
+
diff --git a/odds/src/odds_pipeline/collect_espn.py b/odds/src/odds_pipeline/collect_espn.py
new file mode 100644
index 000000000..917b93025
--- /dev/null
+++ b/odds/src/odds_pipeline/collect_espn.py
@@ -0,0 +1,104 @@
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from datetime import date, datetime, timedelta
+from typing import Any
+
+import requests
+
+from odds_pipeline.config import load_settings
+from odds_pipeline.db import connect
+from odds_pipeline.util import now_utc
+
+
+@dataclass(frozen=True)
+class EspnConfig:
+ base_url: str = "https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard"
+
+
+def _parse_dt(value: str | None) -> datetime | None:
+ if value is None:
+ return None
+ return datetime.fromisoformat(value.replace("Z", "+00:00"))
+
+
+def _ymd(d: date) -> str:
+ return d.strftime("%Y%m%d")
+
+
+def collect_espn_scoreboard(*, start_date: date, end_date: date) -> int:
+ """
+ Collect ESPN scoreboard events between start_date and end_date inclusive.
+
+ ESPN supports `dates=YYYYMMDD` and returns schedules + scores in `events`.
+ """
+ settings = load_settings()
+ cfg = EspnConfig()
+
+ inserted = 0
+ collected_at = now_utc()
+
+ with connect(settings.database_url) as conn:
+ with conn.cursor() as cur:
+ d = start_date
+ while d <= end_date:
+ resp = requests.get(cfg.base_url, params={"dates": _ymd(d)}, timeout=60)
+ resp.raise_for_status()
+ payload: dict[str, Any] = resp.json()
+ for event in payload.get("events", []) or []:
+ external_event_id = str(event.get("id"))
+ commence_time = _parse_dt(event.get("date"))
+ competitions = event.get("competitions", []) or []
+ comp = competitions[0] if competitions else {}
+ status = (
+ (comp.get("status") or {}).get("type") or {}
+ ).get("name")
+ competitors = comp.get("competitors", []) or []
+ home = next((c for c in competitors if c.get("homeAway") == "home"), None)
+ away = next((c for c in competitors if c.get("homeAway") == "away"), None)
+ home_team = (((home or {}).get("team") or {}).get("displayName"))
+ away_team = (((away or {}).get("team") or {}).get("displayName"))
+ home_score = (home or {}).get("score")
+ away_score = (away or {}).get("score")
+
+ cur.execute(
+ """
+ INSERT INTO raw_games_snapshots (
+ source, sport, external_event_id, commence_time,
+ home_team, away_team, status, home_score, away_score,
+ collected_at, raw
+ ) VALUES (
+ %(source)s, %(sport)s, %(external_event_id)s, %(commence_time)s,
+ %(home_team)s, %(away_team)s, %(status)s, %(home_score)s, %(away_score)s,
+ %(collected_at)s, %(raw)s
+ )
+ ON CONFLICT DO NOTHING
+ """,
+ {
+ "source": "espn",
+ "sport": "basketball_ncaab",
+ "external_event_id": external_event_id,
+ "commence_time": commence_time,
+ "home_team": home_team,
+ "away_team": away_team,
+ "status": status,
+ "home_score": int(home_score) if home_score not in (None, "") else None,
+ "away_score": int(away_score) if away_score not in (None, "") else None,
+ "collected_at": collected_at,
+ "raw": json.dumps(event),
+ },
+ )
+ inserted += cur.rowcount
+ d = d + timedelta(days=1)
+
+ conn.commit()
+
+ return inserted
+
+
+def collect_espn_last_days(*, lookback_days: int = 5) -> int:
+ end = now_utc().date()
+ start = end - timedelta(days=int(lookback_days))
+ return collect_espn_scoreboard(start_date=start, end_date=end)
+
diff --git a/odds/src/odds_pipeline/collect_kenpom.py b/odds/src/odds_pipeline/collect_kenpom.py
new file mode 100644
index 000000000..19427d1be
--- /dev/null
+++ b/odds/src/odds_pipeline/collect_kenpom.py
@@ -0,0 +1,161 @@
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from datetime import date
+
+from odds_pipeline.config import load_settings
+from odds_pipeline.db import connect
+from odds_pipeline.util import now_utc
+
+
+@dataclass(frozen=True)
+class KenPomAuth:
+ email: str
+ password: str
+
+
+def _require_auth() -> KenPomAuth:
+ import os
+
+ email = os.getenv("KENPOM_EMAIL")
+ password = os.getenv("KENPOM_PASSWORD")
+ if not email or not password:
+ raise RuntimeError("KENPOM_EMAIL and KENPOM_PASSWORD are required for kenpompy scraping")
+ return KenPomAuth(email=email, password=password)
+
+
+def collect_kenpom_team_metrics(*, season: int, metric_type: str) -> int:
+ """
+ Collect KenPom metrics via kenpompy scraping.
+
+ metric_type supported:
+ - pomeroy_ratings
+ - efficiency
+ - four_factors
+ """
+ settings = load_settings()
+ auth = _require_auth()
+
+ # Import locally so dependency is optional outside this pipeline.
+ from kenpompy.utils import login
+ from kenpompy import misc, summary
+
+ browser = login(auth.email, auth.password)
+
+ if metric_type == "pomeroy_ratings":
+ df = misc.get_pomeroy_ratings(browser, season=str(season))
+ elif metric_type == "efficiency":
+ df = summary.get_efficiency(browser, season=str(season))
+ elif metric_type == "four_factors":
+ df = summary.get_fourfactors(browser, season=str(season))
+ else:
+ raise ValueError(f"Unsupported metric_type: {metric_type}")
+
+ collected_at = now_utc()
+ inserted = 0
+
+ # Store one JSON row per team (best-effort: locate a likely team column).
+ records = df.to_dict(orient="records")
+ team_key = None
+ for candidate in ("Team", "team", "TEAM"):
+ if records and candidate in records[0]:
+ team_key = candidate
+ break
+ if team_key is None:
+ # Fall back: store as a single blob under team='__all__'
+ records = [{"__all__": True, "rows": records}]
+ team_key = "__all__"
+
+ with connect(settings.database_url) as conn:
+ with conn.cursor() as cur:
+ for r in records:
+ if team_key == "__all__":
+ # Fallback blob: always use literal team label "__all__"
+ team = "__all__"
+ else:
+ team = str(r.get(team_key) or "__unknown__")
+ cur.execute(
+ """
+ INSERT INTO raw_kenpom_team_metrics (
+ season, team, metric_type, collected_at, raw
+ ) VALUES (
+ %(season)s, %(team)s, %(metric_type)s, %(collected_at)s, %(raw)s
+ )
+ ON CONFLICT DO NOTHING
+ """,
+ {
+ "season": int(season),
+ "team": team,
+ "metric_type": metric_type,
+ "collected_at": collected_at,
+ "raw": json.dumps(r),
+ },
+ )
+ inserted += cur.rowcount
+ conn.commit()
+
+ return inserted
+
+
+def collect_kenpom_fanmatch(*, game_date: date) -> int:
+ """
+ Collect KenPom FanMatch predictions for a given date via kenpompy.
+ """
+ settings = load_settings()
+ auth = _require_auth()
+
+ from kenpompy.utils import login
+ from kenpompy.FanMatch import FanMatch
+
+ browser = login(auth.email, auth.password)
+ fm = FanMatch(browser, date=game_date.isoformat())
+ df = fm.fm_df
+ if df is None:
+ return 0
+
+ collected_at = now_utc()
+ inserted = 0
+ records = df.to_dict(orient="records")
+
+ with connect(settings.database_url) as conn:
+ with conn.cursor() as cur:
+ for idx, r in enumerate(records):
+ # KenPom seasons use the end-year convention for college hoops.
+ # Example: games in Nov/Dec 2025 belong to the 2026 season.
+ season = game_date.year + 1 if game_date.month >= 7 else game_date.year
+
+ # FanMatch is conceptually game-level data, but we store it in the
+ # team-metrics table. Use a per-row key so the UNIQUE(season, team,
+ # metric_type, collected_at) constraint does not collapse rows.
+ # Prefer a matchup-like column if present; otherwise fall back to
+ # a synthetic stable row key.
+ matchup_key = None
+ for candidate in ("Matchup", "matchup", "Game", "game"):
+ if candidate in r and r[candidate]:
+ matchup_key = str(r[candidate])
+ break
+ team_key = matchup_key or f"__fanmatch_row_{idx}"
+
+ cur.execute(
+ """
+ INSERT INTO raw_kenpom_team_metrics (
+ season, team, metric_type, collected_at, raw
+ ) VALUES (
+ %(season)s, %(team)s, %(metric_type)s, %(collected_at)s, %(raw)s
+ )
+ ON CONFLICT DO NOTHING
+ """,
+ {
+ "season": int(season),
+ "team": team_key,
+ "metric_type": "fanmatch",
+ "collected_at": collected_at,
+ "raw": json.dumps(r),
+ },
+ )
+ inserted += cur.rowcount
+ conn.commit()
+
+ return inserted
+
diff --git a/odds/src/odds_pipeline/collect_odds.py b/odds/src/odds_pipeline/collect_odds.py
new file mode 100644
index 000000000..1c28e25f1
--- /dev/null
+++ b/odds/src/odds_pipeline/collect_odds.py
@@ -0,0 +1,122 @@
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Any
+
+import requests
+
+from odds_pipeline.config import load_settings
+from odds_pipeline.db import connect
+from odds_pipeline.util import now_utc
+
+
+@dataclass(frozen=True)
+class OddsApiConfig:
+ base_url: str = "https://api.the-odds-api.com/v4"
+
+
+def _parse_dt(value: str | None) -> datetime | None:
+ if value is None:
+ return None
+ # Odds API uses ISO timestamps with Z.
+ return datetime.fromisoformat(value.replace("Z", "+00:00"))
+
+
+def collect_odds(
+ *,
+ sport: str,
+ regions: str,
+ markets: str,
+ odds_format: str = "american",
+ date_format: str = "iso",
+) -> int:
+ """
+ Collect odds snapshots and append to raw_odds_snapshots.
+
+ Notes:
+ - This collector is intentionally simple and idempotent by UNIQUE constraint.
+ - Requires ODDS_API_KEY and DATABASE_URL.
+ """
+ settings = load_settings()
+ if not settings.odds_api_key:
+ raise RuntimeError("ODDS_API_KEY is required for collect_odds")
+
+ cfg = OddsApiConfig()
+ url = f"{cfg.base_url}/sports/{sport}/odds"
+ params = {
+ "apiKey": settings.odds_api_key,
+ "regions": regions,
+ "markets": markets,
+ "oddsFormat": odds_format,
+ "dateFormat": date_format,
+ }
+
+ resp = requests.get(url, params=params, timeout=60)
+ resp.raise_for_status()
+ payload: list[dict[str, Any]] = resp.json()
+
+ collected_at = now_utc()
+ inserted = 0
+
+ with connect(settings.database_url) as conn:
+ with conn.cursor() as cur:
+ for event in payload:
+ event_id = str(event.get("id"))
+ commence_time = _parse_dt(event.get("commence_time"))
+ home_team = event.get("home_team")
+ away_team = event.get("away_team")
+ for bookmaker in event.get("bookmakers", []) or []:
+ bookmaker_key = bookmaker.get("key")
+ for market in bookmaker.get("markets", []) or []:
+ market_key = market.get("key")
+ for outcome in market.get("outcomes", []) or []:
+ outcome_name = outcome.get("name")
+ price = outcome.get("price")
+ point = outcome.get("point")
+ raw = {
+ "event": event,
+ "bookmaker": bookmaker_key,
+ "market": market_key,
+ "outcome": outcome,
+ }
+ cur.execute(
+ """
+ INSERT INTO raw_odds_snapshots (
+ source, sport, event_id, commence_time, home_team, away_team,
+ bookmaker_key, market_key, outcome_name, price, point, collected_at, raw
+ ) VALUES (
+ %(source)s, %(sport)s, %(event_id)s, %(commence_time)s, %(home_team)s, %(away_team)s,
+ %(bookmaker_key)s, %(market_key)s, %(outcome_name)s, %(price)s, %(point)s, %(collected_at)s, %(raw)s
+ )
+ ON CONFLICT DO NOTHING
+ """,
+ {
+ "source": "the_odds_api",
+ "sport": sport,
+ "event_id": event_id,
+ "commence_time": commence_time,
+ "home_team": home_team,
+ "away_team": away_team,
+ "bookmaker_key": bookmaker_key,
+ "market_key": market_key,
+ "outcome_name": outcome_name,
+ "price": int(price) if price is not None else None,
+ "point": float(point) if point is not None else None,
+ "collected_at": collected_at,
+ "raw": json.dumps(raw),
+ },
+ )
+ inserted += cur.rowcount
+ conn.commit()
+
+ return inserted
+
+
+if __name__ == "__main__":
+ # Basic manual test; prefer running via cli.py in workflows.
+ raise SystemExit(
+ "Run via `python -m odds_pipeline.cli collect-odds --sport ...`"
+ )
+
diff --git a/odds/src/odds_pipeline/collect_scores.py b/odds/src/odds_pipeline/collect_scores.py
new file mode 100644
index 000000000..c89f4db61
--- /dev/null
+++ b/odds/src/odds_pipeline/collect_scores.py
@@ -0,0 +1,107 @@
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Any
+
+import requests
+
+from odds_pipeline.config import load_settings
+from odds_pipeline.db import connect
+from odds_pipeline.util import now_utc
+
+
+@dataclass(frozen=True)
+class OddsApiConfig:
+ base_url: str = "https://api.the-odds-api.com/v4"
+
+
+def _parse_dt(value: str | None) -> datetime | None:
+ if value is None:
+ return None
+ return datetime.fromisoformat(value.replace("Z", "+00:00"))
+
+
+def collect_scores(*, sport: str, days_from: int = 5, date_format: str = "iso") -> int:
+ """
+ Collect completed/updated scores and append to raw_scores_snapshots.
+
+ The Odds API scores endpoint typically supports rolling windows; we default to 5 days.
+ """
+ settings = load_settings()
+ if not settings.odds_api_key:
+ raise RuntimeError("ODDS_API_KEY is required for collect_scores")
+
+ cfg = OddsApiConfig()
+ url = f"{cfg.base_url}/sports/{sport}/scores"
+ params = {
+ "apiKey": settings.odds_api_key,
+ "daysFrom": int(days_from),
+ "dateFormat": date_format,
+ }
+
+ resp = requests.get(url, params=params, timeout=60)
+ resp.raise_for_status()
+ payload: list[dict[str, Any]] = resp.json()
+
+ collected_at = now_utc()
+ inserted = 0
+
+ with connect(settings.database_url) as conn:
+ with conn.cursor() as cur:
+ for event in payload:
+ event_id = str(event.get("id"))
+ commence_time = _parse_dt(event.get("commence_time"))
+ home_team = event.get("home_team")
+ away_team = event.get("away_team")
+ completed = event.get("completed")
+ last_update = _parse_dt(event.get("last_update"))
+
+ home_score = None
+ away_score = None
+ for score in event.get("scores", []) or []:
+ name = score.get("name")
+ value = score.get("score")
+ if value is None:
+ continue
+ if name == home_team:
+ home_score = int(value)
+ elif name == away_team:
+ away_score = int(value)
+
+ cur.execute(
+ """
+ INSERT INTO raw_scores_snapshots (
+ source, sport, event_id, commence_time, home_team, away_team,
+ completed, home_score, away_score, last_update, collected_at, raw
+ ) VALUES (
+ %(source)s, %(sport)s, %(event_id)s, %(commence_time)s, %(home_team)s, %(away_team)s,
+ %(completed)s, %(home_score)s, %(away_score)s, %(last_update)s, %(collected_at)s, %(raw)s
+ )
+ ON CONFLICT DO NOTHING
+ """,
+ {
+ "source": "the_odds_api",
+ "sport": sport,
+ "event_id": event_id,
+ "commence_time": commence_time,
+ "home_team": home_team,
+ "away_team": away_team,
+ "completed": bool(completed) if completed is not None else None,
+ "home_score": home_score,
+ "away_score": away_score,
+ "last_update": last_update,
+ "collected_at": collected_at,
+ "raw": json.dumps(event),
+ },
+ )
+ inserted += cur.rowcount
+ conn.commit()
+
+ return inserted
+
+
+if __name__ == "__main__":
+ raise SystemExit("Run via `python -m odds_pipeline.cli collect-scores --sport ...`")
+
diff --git a/odds/src/odds_pipeline/config.py b/odds/src/odds_pipeline/config.py
new file mode 100644
index 000000000..734888ee5
--- /dev/null
+++ b/odds/src/odds_pipeline/config.py
@@ -0,0 +1,43 @@
+from __future__ import annotations
+
+from datetime import timedelta
+from os import getenv
+
+from pydantic import BaseModel, Field
+
+
+class Settings(BaseModel):
+ database_url: str = Field(..., description="Postgres connection string", alias="DATABASE_URL")
+ odds_api_key: str | None = Field(default=None, description="API key for collectors", alias="ODDS_API_KEY")
+
+ # Defaults for orchestration. Override via env vars if needed.
+ window_days: int = Field(default=5, alias="WINDOW_DAYS")
+ odds_stale_minutes: int = Field(default=180, alias="ODDS_STALE_MINUTES")
+ scores_stale_hours: int = Field(default=24, alias="SCORES_STALE_HOURS")
+
+ @property
+ def window(self) -> timedelta:
+ return timedelta(days=int(self.window_days))
+
+ @property
+ def odds_stale_for(self) -> timedelta:
+ return timedelta(minutes=int(self.odds_stale_minutes))
+
+ @property
+ def scores_stale_for(self) -> timedelta:
+ return timedelta(hours=int(self.scores_stale_hours))
+
+
+def load_settings() -> Settings:
+ # Pydantic v2 supports reading env via model_validate with dict.
+ env = {
+ "DATABASE_URL": getenv("DATABASE_URL"),
+ "ODDS_API_KEY": getenv("ODDS_API_KEY"),
+ "WINDOW_DAYS": getenv("WINDOW_DAYS"),
+ "ODDS_STALE_MINUTES": getenv("ODDS_STALE_MINUTES"),
+ "SCORES_STALE_HOURS": getenv("SCORES_STALE_HOURS"),
+ }
+ # Remove None keys so defaults apply.
+ env = {k: v for k, v in env.items() if v is not None}
+ return Settings.model_validate(env)
+
diff --git a/odds/src/odds_pipeline/db.py b/odds/src/odds_pipeline/db.py
new file mode 100644
index 000000000..96ecf6a56
--- /dev/null
+++ b/odds/src/odds_pipeline/db.py
@@ -0,0 +1,22 @@
+from __future__ import annotations
+
+from contextlib import contextmanager
+from typing import Iterator
+
+import psycopg
+
+
+@contextmanager
+def connect(database_url: str) -> Iterator[psycopg.Connection]:
+ # autocommit=False so callers explicitly commit or rollback.
+ with psycopg.connect(database_url, autocommit=False) as conn:
+ yield conn
+
+
+def execute_sql_file(conn: psycopg.Connection, sql_path: str) -> None:
+ with open(sql_path, "r", encoding="utf-8") as f:
+ sql = f.read()
+ with conn.cursor() as cur:
+ cur.execute(sql)
+ conn.commit()
+
diff --git a/odds/src/odds_pipeline/freshness.py b/odds/src/odds_pipeline/freshness.py
new file mode 100644
index 000000000..ba5b93828
--- /dev/null
+++ b/odds/src/odds_pipeline/freshness.py
@@ -0,0 +1,77 @@
+from __future__ import annotations
+
+from dataclasses import dataclass
+from datetime import datetime, timedelta
+
+from odds_pipeline.config import load_settings
+from odds_pipeline.db import connect
+from odds_pipeline.util import now_utc
+
+
+@dataclass(frozen=True)
+class FreshnessResult:
+ ok: bool
+ details: dict[str, str]
+
+
+def _max_collected_at(conn, table: str, *, cutoff: datetime | None = None) -> datetime | None:
+ where = ""
+ params: dict[str, object] = {}
+ if cutoff is not None:
+ where = "WHERE collected_at >= %(cutoff)s"
+ params["cutoff"] = cutoff
+ with conn.cursor() as cur:
+ cur.execute(f"SELECT MAX(collected_at) FROM {table} {where}", params)
+ row = cur.fetchone()
+ return row[0] if row else None
+
+
+def check_freshness(*, window_days: int | None = None) -> FreshnessResult:
+ settings = load_settings()
+ window = timedelta(days=window_days if window_days is not None else settings.window_days)
+ cutoff = now_utc() - window
+
+ details: dict[str, str] = {}
+ ok = True
+
+ with connect(settings.database_url) as conn:
+ odds_max = _max_collected_at(conn, "raw_odds_snapshots", cutoff=cutoff)
+ scores_max = _max_collected_at(conn, "raw_scores_snapshots", cutoff=cutoff)
+ games_max = _max_collected_at(conn, "raw_games_snapshots", cutoff=cutoff)
+
+ now = now_utc()
+
+ if odds_max is None:
+ ok = False
+ details["odds"] = "missing"
+ else:
+ age = now - odds_max
+ if age > settings.odds_stale_for:
+ ok = False
+ details["odds"] = f"stale age={age}"
+ else:
+ details["odds"] = f"ok age={age}"
+
+ # Scores/schedules freshness can be satisfied by either the Odds API scores feed
+ # or an external games feed (e.g. ESPN scoreboard).
+ best_scores_max = scores_max
+ if games_max is not None and (best_scores_max is None or games_max > best_scores_max):
+ best_scores_max = games_max
+
+ if best_scores_max is None:
+ ok = False
+ details["scores"] = "missing"
+ else:
+ age = now - best_scores_max
+ if age > settings.scores_stale_for:
+ ok = False
+ details["scores"] = f"stale age={age}"
+ else:
+ details["scores"] = f"ok age={age}"
+
+ # Debug detail: show source maxima without requiring them.
+ details["scores_odds_api_max"] = scores_max.isoformat() if scores_max else "missing"
+ details["scores_external_max"] = games_max.isoformat() if games_max else "missing"
+
+ return FreshnessResult(ok=ok, details=details)
+
diff --git a/odds/src/odds_pipeline/normalize.py b/odds/src/odds_pipeline/normalize.py
new file mode 100644
index 000000000..2b4bc17a5
--- /dev/null
+++ b/odds/src/odds_pipeline/normalize.py
@@ -0,0 +1,222 @@
+from __future__ import annotations
+
+from datetime import timedelta
+
+from odds_pipeline.config import load_settings
+from odds_pipeline.db import connect
+from odds_pipeline.util import american_to_implied_prob, now_utc
+
+
+def normalize_window(*, window_days: int | None = None) -> dict[str, int]:
+ """
+ Normalize raw snapshots into canonical tables for the last N days.
+
+ This implementation is SQL-first for determinism and idempotency:
+ - canonical tables use UNIQUE constraints + ON CONFLICT DO NOTHING
+ - reruns are safe and will not create duplicates
+ """
+ settings = load_settings()
+ window = timedelta(days=window_days if window_days is not None else settings.window_days)
+ cutoff = now_utc() - window
+
+ counts: dict[str, int] = {"spreads": 0, "totals": 0, "moneylines": 0}
+
+ with connect(settings.database_url) as conn:
+ with conn.cursor() as cur:
+ # Spreads (non-pick'em): favorite row (point < 0) paired with underdog row (point > 0).
+ cur.execute(
+ """
+ WITH fav AS (
+ SELECT
+ sport, event_id, commence_time, bookmaker_key, collected_at,
+ outcome_name AS favorite_team,
+ ABS(point)::numeric AS spread_magnitude,
+ price AS favorite_price
+ FROM raw_odds_snapshots
+ WHERE market_key = 'spreads'
+ AND point < 0
+ AND collected_at >= %(cutoff)s
+ ),
+ dog AS (
+ SELECT
+ sport, event_id, commence_time, bookmaker_key, collected_at,
+ outcome_name AS underdog_team,
+ ABS(point)::numeric AS spread_magnitude,
+ price AS underdog_price
+ FROM raw_odds_snapshots
+ WHERE market_key = 'spreads'
+ AND point > 0
+ AND collected_at >= %(cutoff)s
+ )
+ INSERT INTO canonical_spreads (
+ sport, event_id, commence_time, bookmaker_key,
+ favorite_team, underdog_team, spread_magnitude,
+ favorite_price, underdog_price, collected_at
+ )
+ SELECT
+ fav.sport, fav.event_id, fav.commence_time, fav.bookmaker_key,
+ fav.favorite_team, dog.underdog_team, fav.spread_magnitude,
+ fav.favorite_price, dog.underdog_price, fav.collected_at
+ FROM fav
+ JOIN dog
+ ON dog.sport = fav.sport
+ AND dog.event_id = fav.event_id
+ AND dog.bookmaker_key = fav.bookmaker_key
+ AND dog.collected_at = fav.collected_at
+ AND dog.spread_magnitude = fav.spread_magnitude
+ AND dog.commence_time IS NOT DISTINCT FROM fav.commence_time
+ ON CONFLICT DO NOTHING
+ """,
+ {"cutoff": cutoff},
+ )
+ counts["spreads"] = cur.rowcount
+
+ # Spreads (pick'em, point = 0): treat home team as "favorite" for canonicalization
+ # so PK games are preserved with spread_magnitude = 0.
+ cur.execute(
+ """
+ WITH pk AS (
+ SELECT
+ sport,
+ event_id,
+ commence_time,
+ bookmaker_key,
+ collected_at,
+ MAX(home_team) AS home_team,
+ MAX(away_team) AS away_team,
+ MAX(CASE WHEN outcome_name = home_team THEN price END) AS home_price,
+ MAX(CASE WHEN outcome_name = away_team THEN price END) AS away_price
+ FROM raw_odds_snapshots
+ WHERE market_key = 'spreads'
+ AND point = 0
+ AND collected_at >= %(cutoff)s
+ GROUP BY sport, event_id, commence_time, bookmaker_key, collected_at
+ )
+ INSERT INTO canonical_spreads (
+ sport, event_id, commence_time, bookmaker_key,
+ favorite_team, underdog_team, spread_magnitude,
+ favorite_price, underdog_price, collected_at
+ )
+ SELECT
+ sport,
+ event_id,
+ commence_time,
+ bookmaker_key,
+ home_team AS favorite_team,
+ away_team AS underdog_team,
+ 0::numeric AS spread_magnitude,
+ home_price AS favorite_price,
+ away_price AS underdog_price,
+ collected_at
+ FROM pk
+ ON CONFLICT DO NOTHING
+ """,
+ {"cutoff": cutoff},
+ )
+ counts["spreads"] += cur.rowcount
+
+ # Totals: one canonical total per event/book/collected_at using over+under prices on the same number.
+ cur.execute(
+ """
+ WITH over AS (
+ SELECT
+ sport, event_id, commence_time, bookmaker_key, collected_at,
+ point::numeric AS total,
+ price AS over_price
+ FROM raw_odds_snapshots
+ WHERE market_key = 'totals'
+ AND outcome_name ILIKE 'over%'
+ AND collected_at >= %(cutoff)s
+ ),
+ under AS (
+ SELECT
+ sport, event_id, commence_time, bookmaker_key, collected_at,
+ point::numeric AS total,
+ price AS under_price
+ FROM raw_odds_snapshots
+ WHERE market_key = 'totals'
+ AND outcome_name ILIKE 'under%'
+ AND collected_at >= %(cutoff)s
+ )
+ INSERT INTO canonical_totals (
+ sport, event_id, commence_time, bookmaker_key,
+ total, over_price, under_price, collected_at
+ )
+ SELECT
+ over.sport, over.event_id, over.commence_time, over.bookmaker_key,
+ over.total, over.over_price, under.under_price, over.collected_at
+ FROM over
+ JOIN under
+ ON under.sport = over.sport
+ AND under.event_id = over.event_id
+ AND under.bookmaker_key = over.bookmaker_key
+ AND under.collected_at = over.collected_at
+ AND under.total = over.total
+ AND under.commence_time IS NOT DISTINCT FROM over.commence_time
+ ON CONFLICT DO NOTHING
+ """,
+ {"cutoff": cutoff},
+ )
+ counts["totals"] = cur.rowcount
+
+ # Moneylines: keep prices and compute implied probs in Python for correctness.
+ cur.execute(
+ """
+ SELECT sport, event_id, commence_time, bookmaker_key, collected_at,
+ home_team, away_team,
+ MAX(CASE WHEN outcome_name = home_team THEN price END) AS home_price,
+ MAX(CASE WHEN outcome_name = away_team THEN price END) AS away_price
+ FROM raw_odds_snapshots
+ WHERE market_key = 'h2h'
+ AND collected_at >= %(cutoff)s
+ GROUP BY sport, event_id, commence_time, bookmaker_key, collected_at, home_team, away_team
+ """,
+ {"cutoff": cutoff},
+ )
+ rows = cur.fetchall()
+ for (
+ sport,
+ event_id,
+ commence_time,
+ bookmaker_key,
+ collected_at,
+ home_team,
+ away_team,
+ home_price,
+ away_price,
+ ) in rows:
+ home_prob = american_to_implied_prob(int(home_price)) if home_price is not None else None
+ away_prob = american_to_implied_prob(int(away_price)) if away_price is not None else None
+ cur.execute(
+ """
+ INSERT INTO canonical_moneylines (
+ sport, event_id, commence_time, bookmaker_key,
+ home_team, away_team, home_price, away_price,
+ home_implied_prob, away_implied_prob, collected_at
+ ) VALUES (
+ %(sport)s, %(event_id)s, %(commence_time)s, %(bookmaker_key)s,
+ %(home_team)s, %(away_team)s, %(home_price)s, %(away_price)s,
+ %(home_prob)s, %(away_prob)s, %(collected_at)s
+ )
+ ON CONFLICT DO NOTHING
+ """,
+ {
+ "sport": sport,
+ "event_id": event_id,
+ "commence_time": commence_time,
+ "bookmaker_key": bookmaker_key,
+ "home_team": home_team,
+ "away_team": away_team,
+ "home_price": int(home_price) if home_price is not None else None,
+ "away_price": int(away_price) if away_price is not None else None,
+ "home_prob": home_prob,
+ "away_prob": away_prob,
+ "collected_at": collected_at,
+ },
+ )
+ counts["moneylines"] += cur.rowcount
+
+ conn.commit()
+
+ return counts
+
diff --git a/odds/src/odds_pipeline/predict.py b/odds/src/odds_pipeline/predict.py
new file mode 100644
index 000000000..bf4186291
--- /dev/null
+++ b/odds/src/odds_pipeline/predict.py
@@ -0,0 +1,89 @@
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from datetime import timedelta
+from pathlib import Path
+
+from odds_pipeline.config import load_settings
+from odds_pipeline.db import connect
+from odds_pipeline.freshness import check_freshness
+from odds_pipeline.util import now_utc
+
+
+@dataclass(frozen=True)
+class PredictionArtifact:
+ model_version: str
+ window_days: int
+ generated_at: str
+ notes: str
+ sample: list[dict[str, object]]
+
+
+def _artifact_dir() -> Path:
+ return Path(__file__).resolve().parents[2] / "artifacts"
+
+
+def predict(*, model_version: str, window_days: int | None = None, limit: int = 50) -> PredictionArtifact:
+ settings = load_settings()
+ wd = int(window_days if window_days is not None else settings.window_days)
+
+ freshness = check_freshness(window_days=wd)
+ if not freshness.ok:
+ raise RuntimeError(f"Freshness check failed: {freshness.details}")
+
+ cutoff = now_utc() - timedelta(days=wd)
+
+ # Minimal placeholder predictions: emit latest moneyline implied probs as a baseline.
+ sample: list[dict[str, object]] = []
+ with connect(settings.database_url) as conn:
+ with conn.cursor() as cur:
+ cur.execute(
+ """
+ SELECT DISTINCT ON (sport, event_id, bookmaker_key)
+ sport, event_id, bookmaker_key, collected_at,
+ home_team, away_team, home_implied_prob, away_implied_prob
+ FROM canonical_moneylines
+ WHERE collected_at >= %(cutoff)s
+ ORDER BY sport, event_id, bookmaker_key, collected_at DESC
+ LIMIT %(limit)s
+ """,
+ {"cutoff": cutoff, "limit": int(limit)},
+ )
+ for row in cur.fetchall():
+ (
+ sport,
+ event_id,
+ bookmaker_key,
+ collected_at,
+ home_team,
+ away_team,
+ home_p,
+ away_p,
+ ) = row
+ sample.append(
+ {
+ "sport": sport,
+ "event_id": event_id,
+ "bookmaker_key": bookmaker_key,
+ "collected_at": collected_at.isoformat() if collected_at else None,
+ "home_team": home_team,
+ "away_team": away_team,
+ "home_win_prob_baseline": home_p,
+ "away_win_prob_baseline": away_p,
+ }
+ )
+
+ artifact = PredictionArtifact(
+ model_version=model_version,
+ window_days=wd,
+ generated_at=now_utc().isoformat(),
+ notes="Baseline predictions = latest implied probabilities (placeholder).",
+ sample=sample,
+ )
+
+ out_dir = _artifact_dir()
+ out_dir.mkdir(parents=True, exist_ok=True)
+ (out_dir / "predictions.json").write_text(json.dumps(artifact.__dict__, indent=2), encoding="utf-8")
+ return artifact
+
diff --git a/odds/src/odds_pipeline/schema.py b/odds/src/odds_pipeline/schema.py
new file mode 100644
index 000000000..c67515aff
--- /dev/null
+++ b/odds/src/odds_pipeline/schema.py
@@ -0,0 +1,18 @@
+from __future__ import annotations
+
+from pathlib import Path
+
+from odds_pipeline.config import load_settings
+from odds_pipeline.db import connect, execute_sql_file
+
+
+def init_schema() -> None:
+ settings = load_settings()
+ schema_path = Path(__file__).resolve().parents[2] / "sql" / "schema.sql"
+ with connect(settings.database_url) as conn:
+ execute_sql_file(conn, str(schema_path))
+
+
+if __name__ == "__main__":
+ init_schema()
+
diff --git a/odds/src/odds_pipeline/train.py b/odds/src/odds_pipeline/train.py
new file mode 100644
index 000000000..bf8ce30be
--- /dev/null
+++ b/odds/src/odds_pipeline/train.py
@@ -0,0 +1,68 @@
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from datetime import timedelta
+from pathlib import Path
+
+from odds_pipeline.config import load_settings
+from odds_pipeline.db import connect
+from odds_pipeline.freshness import check_freshness
+from odds_pipeline.util import now_utc
+
+
+@dataclass(frozen=True)
+class TrainResult:
+ model_version: str
+ window_days: int
+ trained_at: str
+ metrics: dict[str, float]
+
+
+def _artifact_dir() -> Path:
+ return Path(__file__).resolve().parents[2] / "artifacts"
+
+
+def train(*, window_days: int | None = None) -> TrainResult:
+ settings = load_settings()
+ wd = int(window_days if window_days is not None else settings.window_days)
+
+ freshness = check_freshness(window_days=wd)
+ if not freshness.ok:
+ raise RuntimeError(f"Freshness check failed: {freshness.details}")
+
+ cutoff = now_utc() - timedelta(days=wd)
+ # Minimal baseline “model”: record basic counts so the pipeline is end-to-end deterministic.
+ with connect(settings.database_url) as conn:
+ with conn.cursor() as cur:
+ cur.execute(
+ """
+ SELECT
+ (SELECT COUNT(*) FROM canonical_spreads WHERE collected_at >= %(cutoff)s) AS spreads,
+ (SELECT COUNT(*) FROM canonical_totals WHERE collected_at >= %(cutoff)s) AS totals,
+ (SELECT COUNT(*) FROM canonical_moneylines WHERE collected_at >= %(cutoff)s) AS moneylines
+ """,
+ {"cutoff": cutoff},
+ )
+ spreads, totals, moneylines = cur.fetchone()
+
+ trained_at = now_utc().isoformat()
+ model_version = trained_at.replace(":", "").replace("-", "").split(".")[0] + "Z"
+ metrics = {
+ "rows_spreads": float(spreads),
+ "rows_totals": float(totals),
+ "rows_moneylines": float(moneylines),
+ }
+
+ result = TrainResult(
+ model_version=model_version,
+ window_days=wd,
+ trained_at=trained_at,
+ metrics=metrics,
+ )
+
+ out_dir = _artifact_dir()
+ out_dir.mkdir(parents=True, exist_ok=True)
+ (out_dir / "model.json").write_text(json.dumps(result.__dict__, indent=2), encoding="utf-8")
+ return result
+
diff --git a/odds/src/odds_pipeline/util.py b/odds/src/odds_pipeline/util.py
new file mode 100644
index 000000000..1cbf3bbd9
--- /dev/null
+++ b/odds/src/odds_pipeline/util.py
@@ -0,0 +1,16 @@
+from __future__ import annotations
+
+from datetime import datetime, timezone
+
+
+def now_utc() -> datetime:
+ return datetime.now(timezone.utc)
+
+
+def american_to_implied_prob(odds: int) -> float:
+ if odds == 0:
+ raise ValueError("American odds cannot be 0")
+ if odds < 0:
+ return abs(odds) / (abs(odds) + 100.0)
+ return 100.0 / (odds + 100.0)
+
diff --git a/odds/src/odds_pipeline/validate.py b/odds/src/odds_pipeline/validate.py
new file mode 100644
index 000000000..ee5d136f8
--- /dev/null
+++ b/odds/src/odds_pipeline/validate.py
@@ -0,0 +1,24 @@
+from __future__ import annotations
+
+from odds_pipeline.util import american_to_implied_prob
+
+
+def validate_normalization_math() -> None:
+ # Moneyline conversion sanity.
+ p1 = american_to_implied_prob(-110)
+ p2 = american_to_implied_prob(110)
+ if not (0.0 < p1 < 1.0 and 0.0 < p2 < 1.0):
+ raise AssertionError("Implied probability out of range")
+
+ # Spread normalization rules are enforced by schema + normalize SQL:
+ # - spread_magnitude is ABS(point)
+ # - favorite row requires point < 0, underdog row requires point > 0
+
+
+def main() -> None:
+ validate_normalization_math()
+
+
+if __name__ == "__main__":
+ main()
+
diff --git a/package.json b/package.json
index 34dfe95cf..3c0a946e2 100644
--- a/package.json
+++ b/package.json
@@ -14,6 +14,7 @@
"check-format": "eslint --cache . && prettier --check --cache .;",
"docs": "npm run build && npm run docs:generate && npm run format",
"docs:generate": "node --experimental-strip-types scripts/generate-docs.ts",
+ "docs:lint": "node --experimental-strip-types scripts/check-docs.ts",
"start": "npm run build && node build/src/index.js",
"start-debug": "DEBUG=mcp:* DEBUG_COLORS=false npm run build && node build/src/index.js",
"test": "npm run build && node scripts/test.mjs",
diff --git a/scripts/analysis/.gitkeep b/scripts/analysis/.gitkeep
new file mode 100644
index 000000000..e69de29bb
diff --git a/scripts/analysis/analyze_bookmaker_accuracy.py b/scripts/analysis/analyze_bookmaker_accuracy.py
new file mode 100644
index 000000000..2774bce52
--- /dev/null
+++ b/scripts/analysis/analyze_bookmaker_accuracy.py
@@ -0,0 +1,290 @@
+"""Analyze bookmaker accuracy and generate reports.
+
+This script calculates accuracy metrics for all bookmakers and generates
+comparison reports to identify which books are most/least accurate.
+
+Usage:
+ # Analyze spread accuracy for all bookmakers
+ uv run python scripts/analyze_bookmaker_accuracy.py analyze-spreads \
+ --start-date 2025-11-01 --end-date 2026-02-01
+
+ # Analyze totals accuracy
+ uv run python scripts/analyze_bookmaker_accuracy.py analyze-totals
+
+ # Find systematic biases for a specific bookmaker
+ uv run python scripts/analyze_bookmaker_accuracy.py find-biases \
+ --book-key fanduel --market spreads
+
+ # Identify best bookmaker by spread range
+ uv run python scripts/analyze_bookmaker_accuracy.py best-by-range \
+ --market spreads
+"""
+
+import logging
+from datetime import date
+from pathlib import Path
+
+import typer
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.services.bookmaker_accuracy import BookmakerAccuracyAnalyzer
+
+logging.basicConfig(level=logging.INFO, format="%(message)s")
+logger = logging.getLogger(__name__)
+
+app = typer.Typer(help="Bookmaker accuracy analysis tools")
+
+
+DEFAULT_DB_PATH = "data/odds_api/odds_api.sqlite3"
+DEFAULT_SAVE_DIR = "data/analysis"
+
+
+@app.command()
+def analyze_spreads(
+ start_date: str = "2025-11-01",
+ end_date: str = "2026-02-01",
+ save_dir: str = DEFAULT_SAVE_DIR,
+ min_games: int = 50,
+ db_path: str = DEFAULT_DB_PATH,
+) -> None:
+ """Analyze spread accuracy for all bookmakers.
+
+ Calculates MAE, RMSE, and directional accuracy metrics.
+ """
+ logger.info("[OK] Analyzing spread accuracy...")
+ logger.info(f"Date range: {start_date} to {end_date}")
+ logger.info(f"Minimum games: {min_games}\n")
+
+ db = OddsAPIDatabase(db_path)
+ analyzer = BookmakerAccuracyAnalyzer(db)
+
+ # Get rankings
+ rankings = analyzer.rank_bookmakers(
+ market_type="spreads",
+ metric="mae",
+ start_date=date.fromisoformat(start_date),
+ end_date=date.fromisoformat(end_date),
+ min_games=min_games,
+ )
+
+ if len(rankings) == 0:
+ logger.warning("[WARNING] No bookmakers found with sufficient data")
+ return
+
+ # Display results
+ logger.info("=" * 80)
+ logger.info("SPREAD ACCURACY RANKINGS (by Mean Absolute Error)")
+ logger.info("=" * 80)
+ logger.info(
+ f"{'Rank':<6}{'Bookmaker':<20}{'MAE':<10}{'RMSE':<10}{'Dir. Acc.':<12}{'Games':<10}"
+ )
+ logger.info("-" * 80)
+
+ for _, row in rankings.iterrows():
+ logger.info(
+ f"{int(row['rank']):<6}{row['book_key']:<20}"
+ f"{row['mae']:<10.2f}{row['rmse']:<10.2f}"
+ f"{row['directional_accuracy']:<12.1%}{int(row['sample_size']):<10}"
+ )
+
+ logger.info("=" * 80)
+ logger.info(f"\nBest overall (MAE): {rankings.iloc[0]['book_key']}")
+ logger.info(f"Worst overall (MAE): {rankings.iloc[-1]['book_key']}")
+
+ # Save results
+ save_path = Path(save_dir) / "bookmaker_accuracy_spreads.csv"
+ save_path.parent.mkdir(parents=True, exist_ok=True)
+ rankings.to_csv(save_path, index=False)
+
+ logger.info(f"\n[OK] Results saved to: {save_path}")
+
+
+@app.command()
+def analyze_totals(
+ start_date: str = "2025-11-01",
+ end_date: str = "2026-02-01",
+ save_dir: str = DEFAULT_SAVE_DIR,
+ min_games: int = 50,
+ db_path: str = DEFAULT_DB_PATH,
+) -> None:
+ """Analyze totals accuracy for all bookmakers.
+
+ Calculates MAE, RMSE, and over/under percentages.
+ """
+ logger.info("[OK] Analyzing totals accuracy...")
+ logger.info(f"Date range: {start_date} to {end_date}")
+ logger.info(f"Minimum games: {min_games}\n")
+
+ db = OddsAPIDatabase(db_path)
+ analyzer = BookmakerAccuracyAnalyzer(db)
+
+ rankings = analyzer.rank_bookmakers(
+ market_type="totals",
+ metric="mae",
+ start_date=date.fromisoformat(start_date),
+ end_date=date.fromisoformat(end_date),
+ min_games=min_games,
+ )
+
+ if len(rankings) == 0:
+ logger.warning("[WARNING] No bookmakers found with sufficient data")
+ return
+
+ # Display results
+ logger.info("=" * 80)
+ logger.info("TOTALS ACCURACY RANKINGS (by Mean Absolute Error)")
+ logger.info("=" * 80)
+ logger.info(f"{'Rank':<6}{'Bookmaker':<20}{'MAE':<10}{'RMSE':<10}{'Over %':<12}{'Games':<10}")
+ logger.info("-" * 80)
+
+ for _, row in rankings.iterrows():
+ logger.info(
+ f"{int(row['rank']):<6}{row['book_key']:<20}"
+ f"{row['mae']:<10.2f}{row['rmse']:<10.2f}"
+ f"{row['over_pct']:<12.1%}{int(row['sample_size']):<10}"
+ )
+
+ logger.info("=" * 80)
+ logger.info(f"\nBest overall (MAE): {rankings.iloc[0]['book_key']}")
+ logger.info(f"Worst overall (MAE): {rankings.iloc[-1]['book_key']}")
+
+ # Save results
+ save_path = Path(save_dir) / "bookmaker_accuracy_totals.csv"
+ save_path.parent.mkdir(parents=True, exist_ok=True)
+ rankings.to_csv(save_path, index=False)
+
+ logger.info(f"\n[OK] Results saved to: {save_path}")
+
+
+@app.command()
+def find_biases(
+ book_key: str = "fanduel",
+ market: str = "spreads",
+ start_date: str | None = None,
+ end_date: str | None = None,
+ db_path: str = DEFAULT_DB_PATH,
+) -> None:
+ """Detect systematic biases for a specific bookmaker.
+
+ Args:
+ book_key: Bookmaker to analyze (e.g., fanduel, pinnacle)
+ market: Market type (spreads or totals)
+ start_date: Optional start date (YYYY-MM-DD)
+ end_date: Optional end date (YYYY-MM-DD)
+ """
+ logger.info(f"[OK] Analyzing systematic biases: {book_key} ({market})\n")
+
+ db = OddsAPIDatabase(db_path)
+ analyzer = BookmakerAccuracyAnalyzer(db)
+
+ start = date.fromisoformat(start_date) if start_date else None
+ end = date.fromisoformat(end_date) if end_date else None
+
+ biases = analyzer.detect_systematic_biases(
+ book_key=book_key, market_type=market, start_date=start, end_date=end
+ )
+
+ logger.info("=" * 60)
+ logger.info(f"SYSTEMATIC BIAS ANALYSIS: {book_key}")
+ logger.info("=" * 60)
+ logger.info(f"Market: {market}")
+ logger.info(f"Sample size: {biases['sample_size']} games\n")
+
+ if market == "spreads":
+ logger.info(f"Favorite cover rate: {biases.get('favorite_cover_pct', 0):.1%}")
+ logger.info("Expected (efficient market): 50.0%\n")
+
+ if biases["overestimates_favorites"]:
+ logger.info("[WARNING] BIAS DETECTED: Overestimates favorites")
+ logger.info("-> Recommendation: Bet underdogs at this book")
+ elif biases["overestimates_underdogs"]:
+ logger.info("[WARNING] BIAS DETECTED: Overestimates underdogs")
+ logger.info("-> Recommendation: Bet favorites at this book")
+ else:
+ logger.info("[OK] No significant bias detected (within 48-52%)")
+
+ elif market == "totals":
+ logger.info(f"Over rate: {biases.get('over_pct', 0):.1%}")
+ logger.info("Expected (efficient market): 50.0%\n")
+
+ if biases["overestimates_overs"]:
+ logger.info("[WARNING] BIAS DETECTED: Overestimates overs")
+ logger.info("-> Recommendation: Bet unders at this book")
+ elif biases["overestimates_unders"]:
+ logger.info("[WARNING] BIAS DETECTED: Overestimates unders")
+ logger.info("-> Recommendation: Bet overs at this book")
+ else:
+ logger.info("[OK] No significant bias detected (within 48-52%)")
+
+ logger.info("=" * 60)
+
+
+@app.command()
+def best_by_range(
+ market: str = "spreads",
+ start_date: str | None = None,
+ end_date: str | None = None,
+ db_path: str = DEFAULT_DB_PATH,
+) -> None:
+ """Identify which bookmaker is most accurate for each range.
+
+ For spreads: 0-3, 3.5-7, 7.5-20
+ For totals: 0-135, 135-150, 150-200
+ """
+ logger.info(f"[OK] Finding best bookmakers by {market} range...\n")
+
+ db = OddsAPIDatabase(db_path)
+ analyzer = BookmakerAccuracyAnalyzer(db)
+
+ start = date.fromisoformat(start_date) if start_date else None
+ end = date.fromisoformat(end_date) if end_date else None
+
+ best_by_range_result = analyzer.identify_best_by_range(
+ market_type=market, start_date=start, end_date=end
+ )
+
+ if not best_by_range_result:
+ logger.warning("[WARNING] Insufficient data for range analysis")
+ return
+
+ logger.info("=" * 60)
+ logger.info(f"BEST BOOKMAKER BY {market.upper()} RANGE")
+ logger.info("=" * 60)
+
+ for range_key, book_key in best_by_range_result.items():
+ logger.info(f"{range_key:20s} -> {book_key}")
+
+ logger.info("=" * 60)
+
+
+@app.command()
+def database_stats(db_path: str = DEFAULT_DB_PATH) -> None:
+ """Show database coverage statistics."""
+ logger.info("[OK] Database coverage statistics\n")
+
+ db = OddsAPIDatabase(db_path)
+ stats = db.get_database_stats()
+
+ logger.info("=" * 60)
+ logger.info("DATABASE STATISTICS")
+ logger.info("=" * 60)
+ logger.info(f"Total events: {stats['total_events']}")
+ logger.info(f"Events with scores: {stats['events_with_scores']}")
+ logger.info(f"Coverage: {stats['events_with_scores'] / max(stats['total_events'], 1):.1%}\n")
+
+ logger.info(f"Date range: {stats['date_range'][0]} to {stats['date_range'][1]}\n")
+
+ logger.info("Bookmaker coverage (spreads):")
+ logger.info(f"{'Bookmaker':<20}{'Games':<10}{'Coverage %':<12}")
+ logger.info("-" * 60)
+
+ for book in stats["bookmaker_coverage"][:10]: # Show top 10
+ logger.info(
+ f"{book['book_key']:<20}{book['games_covered']:<10}{book['coverage_pct']:<12.1f}%"
+ )
+
+ logger.info("=" * 60)
+
+
+if __name__ == "__main__":
+ app()
diff --git a/scripts/analysis/analyze_clv_multi_day.py b/scripts/analysis/analyze_clv_multi_day.py
new file mode 100644
index 000000000..109aa32ad
--- /dev/null
+++ b/scripts/analysis/analyze_clv_multi_day.py
@@ -0,0 +1,479 @@
+#!/usr/bin/env python3
+"""Multi-day CLV analysis using Overtime lines and ESPN scores.
+
+Analyzes closing line value across multiple dates to identify:
+- Systematic edges (favorites vs underdogs, overs vs unders)
+- Sharp money accuracy over time
+- Profitable betting patterns
+
+Usage:
+ uv run python scripts/analyze_clv_multi_day.py \\
+ --start 2026-01-31 --end 2026-02-05
+ uv run python scripts/analyze_clv_multi_day.py \\
+ --start 2026-01-31 --end 2026-02-05 --output data/clv_analysis.csv
+ uv run python scripts/analyze_clv_multi_day.py \\
+ --start 2026-01-31 --end 2026-02-05 --collect-scores
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import subprocess
+from datetime import datetime, timedelta
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+from sqlalchemy import select
+
+from sports_betting_edge.adapters.database import (
+ OvertimeLineSnapshotDB,
+ OvertimeOpeningLineDB,
+ create_database_engine,
+ make_session_factory,
+)
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+logger = logging.getLogger(__name__)
+
+# Team name mappings (ESPN -> Overtime)
+TEAM_NAME_MAP = {
+ "Queens University": "Queens NC",
+ "UMBC": "MD Baltimore Co",
+ "UAlbany": "Albany NY",
+ "Mount St. Mary's": "Mt. St. Marys",
+ "Saint Peter's": "St. Peters",
+ "Saint Francis": "St. Francis PA",
+ "Long Island University": "Long Island",
+ "UNC Wilmington": "NC Wilmington",
+ "Charleston": "Coll Of Charleston",
+ "Florida Gulf Coast": "Fla Gulf Coast",
+ "Omaha": "Nebraska Omaha",
+ "Southeast Missouri State": "SE Missouri State",
+ "UT Martin": "Tennessee Martin",
+ "UC Santa Barbara": "Cal Santa Barbara",
+ "Northern Colorado": "No. Colorado",
+ "UC Riverside": "Cal Riverside",
+ "Cal State Fullerton": "CS Fullerton",
+ "Cal Poly": "Cal Poly SLO",
+ "Cal State Northridge": "CS Northridge",
+ "Cal State Bakersfield": "CS Bakersfield",
+ "UC Irvine": "Cal Irvine",
+}
+
+
+def collect_scores_for_range(start_date: str, end_date: str) -> None:
+ """Run ESPN score collection for date range.
+
+ Args:
+ start_date: Start date (YYYY-MM-DD)
+ end_date: End date (YYYY-MM-DD)
+ """
+ logger.info("Collecting ESPN scores from %s to %s", start_date, end_date)
+ cmd = [
+ "uv",
+ "run",
+ "python",
+ "scripts/backfill_espn_scores.py",
+ "--start",
+ start_date,
+ "--end",
+ end_date,
+ ]
+ subprocess.run(cmd, check=True)
+
+
+def get_opening_lines(session, start_date: str, end_date: str) -> pd.DataFrame:
+ """Get all opening lines for date range.
+
+ Args:
+ session: SQLAlchemy session
+ start_date: Start date (YYYY-MM-DD)
+ end_date: End date (YYYY-MM-DD)
+
+ Returns:
+ DataFrame with opening lines
+ """
+ # Convert dates to search patterns
+ start_dt = datetime.strptime(start_date, "%Y-%m-%d")
+ end_dt = datetime.strptime(end_date, "%Y-%m-%d") + timedelta(days=1)
+
+ # Get all opening lines in date range (inclusive of end_date)
+ stmt = select(OvertimeOpeningLineDB).where(
+ OvertimeOpeningLineDB.opened_at >= start_dt,
+ OvertimeOpeningLineDB.opened_at < end_dt,
+ )
+ openings = session.execute(stmt).scalars().all()
+
+ records = []
+ for opening in openings:
+ records.append(
+ {
+ "game_id": opening.game_id,
+ "category": opening.category,
+ "opened_at": opening.opened_at,
+ "game_date": opening.game_date_str,
+ "away_team": opening.away_team,
+ "home_team": opening.home_team,
+ "open_spread": opening.spread_magnitude,
+ "open_favorite": opening.favorite_team,
+ "open_fav_price": opening.spread_favorite_price,
+ "open_dog_price": opening.spread_underdog_price,
+ "open_total": opening.total_points,
+ "open_over_price": opening.total_over_price,
+ "open_under_price": opening.total_under_price,
+ }
+ )
+
+ return pd.DataFrame(records)
+
+
+def get_closing_lines(session, game_ids: list[str]) -> pd.DataFrame:
+ """Get latest pre-game snapshots (closing lines) for games.
+
+ Args:
+ session: SQLAlchemy session
+ game_ids: List of game IDs
+
+ Returns:
+ DataFrame with closing lines
+ """
+ records = []
+
+ for game_id in game_ids:
+ # Get all snapshots for this game
+ stmt = (
+ select(OvertimeLineSnapshotDB)
+ .where(OvertimeLineSnapshotDB.game_id == game_id)
+ .order_by(OvertimeLineSnapshotDB.captured_at.desc())
+ )
+ snapshots = session.execute(stmt).scalars().all()
+
+ # Find latest pre-game snapshot (filter out live betting lines)
+ for snapshot in snapshots:
+ # Check if this looks like pre-game (not live betting)
+ spread_ok = snapshot.spread_magnitude is None or snapshot.spread_magnitude >= 0.5
+ total_ok = snapshot.total_points is None or snapshot.total_points >= 100.0
+
+ if spread_ok and total_ok:
+ records.append(
+ {
+ "game_id": snapshot.game_id,
+ "close_spread": snapshot.spread_magnitude,
+ "close_favorite": snapshot.favorite_team,
+ "close_fav_price": snapshot.spread_favorite_price,
+ "close_dog_price": snapshot.spread_underdog_price,
+ "close_total": snapshot.total_points,
+ "close_over_price": snapshot.total_over_price,
+ "close_under_price": snapshot.total_under_price,
+ "closed_at": snapshot.captured_at,
+ }
+ )
+ break
+
+ return pd.DataFrame(records)
+
+
+def load_espn_scores(db_path: str, start_date: str, end_date: str) -> pd.DataFrame:
+ """Load ESPN scores from database for date range.
+
+ Args:
+ db_path: Path to database
+ start_date: Start date (YYYY-MM-DD)
+ end_date: End date (YYYY-MM-DD)
+
+ Returns:
+ DataFrame with scores
+ """
+ db = OddsAPIDatabase(db_path)
+
+ query = f"""
+ SELECT
+ game_date,
+ away_team,
+ away_score,
+ home_team,
+ home_score,
+ completed
+ FROM espn_scores
+ WHERE game_date >= '{start_date}'
+ AND game_date <= '{end_date}'
+ AND completed = 1
+ """
+
+ scores_df = pd.read_sql_query(query, db.conn)
+
+ # Apply team name mapping
+ scores_df["away_team_normalized"] = scores_df["away_team"].replace(TEAM_NAME_MAP)
+ scores_df["home_team_normalized"] = scores_df["home_team"].replace(TEAM_NAME_MAP)
+
+ logger.info("Loaded %d ESPN scores", len(scores_df))
+ return scores_df
+
+
+def calculate_clv_metrics(
+ openings: pd.DataFrame, closings: pd.DataFrame, scores: pd.DataFrame
+) -> pd.DataFrame:
+ """Calculate CLV metrics for all games.
+
+ Args:
+ openings: Opening lines DataFrame
+ closings: Closing lines DataFrame
+ scores: Scores DataFrame
+
+ Returns:
+ DataFrame with CLV analysis
+ """
+ # Merge openings with closings
+ lines = openings.merge(closings, on="game_id", how="left")
+
+ # Calculate line movements
+ lines["spread_movement"] = lines["close_spread"] - lines["open_spread"]
+ lines["total_movement"] = lines["close_total"] - lines["open_total"]
+
+ # Merge with scores
+ matched = lines.merge(
+ scores,
+ left_on=["away_team", "home_team"],
+ right_on=["away_team_normalized", "home_team_normalized"],
+ how="inner",
+ suffixes=("_line", "_score"),
+ )
+
+ # Calculate results
+ matched["actual_margin"] = matched["home_score"] - matched["away_score"]
+ matched["actual_total"] = matched["home_score"] + matched["away_score"]
+
+ # Spread analysis
+ matched["fav_is_home"] = matched["close_favorite"] == matched["home_team_line"]
+ matched["fav_covers"] = np.where(
+ matched["fav_is_home"],
+ matched["actual_margin"] > matched["close_spread"],
+ -matched["actual_margin"] > matched["close_spread"],
+ )
+
+ # Total analysis
+ matched["over_covers"] = matched["actual_total"] > matched["close_total"]
+
+ # Sharp money analysis
+ matched["spread_moved_to_fav"] = matched["spread_movement"] > 0
+ matched["total_moved_up"] = matched["total_movement"] > 0
+
+ matched["spread_move_correct"] = matched["spread_moved_to_fav"] == matched["fav_covers"]
+ matched["total_move_correct"] = matched["total_moved_up"] == matched["over_covers"]
+
+ logger.info("Calculated CLV metrics for %d games", len(matched))
+ return matched
+
+
+def print_summary(results: pd.DataFrame) -> None:
+ """Print comprehensive CLV summary.
+
+ Args:
+ results: Results DataFrame
+ """
+ if len(results) == 0:
+ print("\n[WARNING] No games to analyze")
+ return
+
+ print("\n" + "=" * 100)
+ print("MULTI-DAY CLOSING LINE VALUE (CLV) ANALYSIS")
+ print("=" * 100)
+ print()
+
+ # Date range
+ print(
+ f"Analysis period: {results['game_date_line'].min()} to {results['game_date_line'].max()}"
+ )
+ print(f"Total games analyzed: {len(results)}")
+ print(f"Games per day: {len(results) / results['game_date_line'].nunique():.1f}")
+ print()
+
+ # Overall spread performance
+ print("SPREAD PERFORMANCE (Against Closing Line):")
+ print("-" * 100)
+ fav_covers = results["fav_covers"].sum()
+ dog_covers = len(results) - fav_covers
+ fav_pct = 100 * fav_covers / len(results)
+ dog_pct = 100 * dog_covers / len(results)
+
+ print(f" Favorites: {fav_covers:3d} - {dog_covers:3d} ({fav_pct:5.1f}%)")
+ print(f" Underdogs: {dog_covers:3d} - {fav_covers:3d} ({dog_pct:5.1f}%)")
+ print(" Break-even: 52.4% at -110 odds")
+
+ if dog_pct > 52.4:
+ edge = dog_pct - 52.4
+ print(f" Result: UNDERDOGS have {edge:+.1f}% edge - PROFITABLE!")
+ elif fav_pct > 52.4:
+ edge = fav_pct - 52.4
+ print(f" Result: FAVORITES have {edge:+.1f}% edge - PROFITABLE!")
+ else:
+ print(" Result: No significant edge")
+ print()
+
+ # Overall total performance
+ print("TOTAL PERFORMANCE (Against Closing Line):")
+ print("-" * 100)
+ overs = results["over_covers"].sum()
+ unders = len(results) - overs
+ over_pct = 100 * overs / len(results)
+ under_pct = 100 * unders / len(results)
+
+ print(f" Overs: {overs:3d} - {unders:3d} ({over_pct:5.1f}%)")
+ print(f" Unders: {unders:3d} - {overs:3d} ({under_pct:5.1f}%)")
+ print(" Break-even: 52.4% at -110 odds")
+
+ if under_pct > 52.4:
+ edge = under_pct - 52.4
+ print(f" Result: UNDERS have {edge:+.1f}% edge - PROFITABLE!")
+ elif over_pct > 52.4:
+ edge = over_pct - 52.4
+ print(f" Result: OVERS have {edge:+.1f}% edge - PROFITABLE!")
+ else:
+ print(" Result: No significant edge")
+ print()
+
+ # Sharp money accuracy
+ print("SHARP MONEY ACCURACY:")
+ print("-" * 100)
+
+ # Spread movements
+ moved_spreads = results[results["spread_movement"].abs() >= 0.5]
+ if len(moved_spreads) > 0:
+ correct = moved_spreads["spread_move_correct"].sum()
+ accuracy = 100 * correct / len(moved_spreads)
+ print("Spread movements (>= 0.5 pts):")
+ pct = 100 * len(moved_spreads) / len(results)
+ print(f" Games with movement: {len(moved_spreads):3d} / {len(results):3d} ({pct:.1f}%)")
+ print(f" Sharp money accuracy: {correct:3d} / {len(moved_spreads):3d} ({accuracy:.1f}%)")
+ print(" Random chance: 50.0%")
+ if accuracy > 50:
+ print(f" Result: Sharp money PROFITABLE (edge: {accuracy - 50:+.1f}%)")
+ else:
+ print(f" Result: Sharp money UNPROFITABLE (fade for {50 - accuracy:+.1f}% edge)")
+ print()
+
+ # Total movements
+ moved_totals = results[results["total_movement"].abs() >= 0.5]
+ if len(moved_totals) > 0:
+ correct = moved_totals["total_move_correct"].sum()
+ accuracy = 100 * correct / len(moved_totals)
+ print("Total movements (>= 0.5 pts):")
+ pct = 100 * len(moved_totals) / len(results)
+ print(f" Games with movement: {len(moved_totals):3d} / {len(results):3d} ({pct:.1f}%)")
+ print(f" Sharp money accuracy: {correct:3d} / {len(moved_totals):3d} ({accuracy:.1f}%)")
+ print(" Random chance: 50.0%")
+ if accuracy > 50:
+ print(f" Result: Sharp money PROFITABLE (edge: {accuracy - 50:+.1f}%)")
+ else:
+ print(f" Result: Sharp money UNPROFITABLE (fade for {50 - accuracy:+.1f}% edge)")
+ print()
+
+ # Day-by-day breakdown
+ print("DAILY BREAKDOWN:")
+ print("-" * 100)
+ daily = results.groupby("game_date_line").agg(
+ {
+ "game_id": "count",
+ "fav_covers": "sum",
+ "over_covers": "sum",
+ }
+ )
+ daily["dog_covers"] = daily["game_id"] - daily["fav_covers"]
+ daily["under_covers"] = daily["game_id"] - daily["over_covers"]
+ daily["fav_pct"] = 100 * daily["fav_covers"] / daily["game_id"]
+ daily["over_pct"] = 100 * daily["over_covers"] / daily["game_id"]
+
+ print(f"{'Date':12s} {'Games':>6s} {'Fav%':>6s} {'Over%':>6s}")
+ for date, row in daily.iterrows():
+ print(f"{date:12s} {row['game_id']:6.0f} {row['fav_pct']:6.1f} {row['over_pct']:6.1f}")
+
+
+def main() -> None:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(
+ description="Multi-day CLV analysis using Overtime lines and ESPN scores"
+ )
+ parser.add_argument("--start", required=True, help="Start date (YYYY-MM-DD)")
+ parser.add_argument("--end", required=True, help="End date (YYYY-MM-DD)")
+ parser.add_argument(
+ "--overtime-db",
+ type=Path,
+ default=Path("data/source/overtime/overtime_lines.db"),
+ help="Path to Overtime database",
+ )
+ parser.add_argument(
+ "--odds-db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to Odds API database (for ESPN scores)",
+ )
+ parser.add_argument(
+ "--output",
+ "-o",
+ type=Path,
+ help="Output CSV file for detailed results",
+ )
+ parser.add_argument(
+ "--collect-scores",
+ action="store_true",
+ help="Run ESPN score collection before analysis",
+ )
+ parser.add_argument("--verbose", "-v", action="store_true", help="Enable debug logging")
+
+ args = parser.parse_args()
+
+ # Configure logging
+ logging.basicConfig(
+ level=logging.DEBUG if args.verbose else logging.INFO,
+ format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+ )
+
+ # Collect scores if requested
+ if args.collect_scores:
+ collect_scores_for_range(args.start, args.end)
+
+ # Get opening and closing lines from Overtime database
+ logger.info("Loading opening and closing lines from Overtime database")
+ overtime_url = f"sqlite:///{args.overtime_db}"
+ engine = create_database_engine(overtime_url)
+ SessionFactory = make_session_factory(engine)
+
+ with SessionFactory() as session:
+ openings = get_opening_lines(session, args.start, args.end)
+ logger.info("Loaded %d opening lines", len(openings))
+
+ if len(openings) == 0:
+ logger.error("No opening lines found for date range")
+ return
+
+ closings = get_closing_lines(session, openings["game_id"].tolist())
+ logger.info("Loaded %d closing lines", len(closings))
+
+ # Get ESPN scores
+ logger.info("Loading ESPN scores from database")
+ scores = load_espn_scores(str(args.odds_db), args.start, args.end)
+
+ if len(scores) == 0:
+ logger.warning("No ESPN scores found - run with --collect-scores first")
+ return
+
+ # Calculate CLV metrics
+ results = calculate_clv_metrics(openings, closings, scores)
+
+ if len(results) == 0:
+ logger.error("No matching games found between lines and scores")
+ return
+
+ # Save detailed results if requested
+ if args.output:
+ results.to_csv(args.output, index=False)
+ logger.info("Wrote detailed results to %s", args.output)
+
+ # Print summary
+ print_summary(results)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/analyze_clv_results.py b/scripts/analysis/analyze_clv_results.py
new file mode 100644
index 000000000..022addbad
--- /dev/null
+++ b/scripts/analysis/analyze_clv_results.py
@@ -0,0 +1,291 @@
+#!/usr/bin/env python3
+"""Analyze Closing Line Value (CLV) results against actual game outcomes.
+
+Compares opening/closing lines to final scores to determine:
+- If line movements were "sharp" (moved toward eventual winner)
+- CLV for betting opening vs. closing lines
+- Which side (favorites/underdogs, overs/unders) performed better
+
+Usage:
+ uv run python scripts/analysis/analyze_clv_results.py \\
+ data/feb5_line_movements_corrected.csv
+ uv run python scripts/analysis/analyze_clv_results.py \\
+ data/feb5_line_movements_corrected.csv --output data/clv_results.csv
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+logger = logging.getLogger(__name__)
+
+
+def load_scores(db_path: str) -> pd.DataFrame:
+ """Load scores from database.
+
+ Args:
+ db_path: Path to SQLite database
+
+ Returns:
+ DataFrame with scores
+ """
+ db = OddsAPIDatabase(db_path)
+
+ query = """
+ SELECT
+ e.home_team,
+ e.away_team,
+ e.commence_time,
+ s.home_score,
+ s.away_score,
+ s.completed
+ FROM scores s
+ JOIN events e ON s.event_id = e.event_id
+ WHERE s.completed = 1
+ AND s.home_score IS NOT NULL
+ AND s.away_score IS NOT NULL
+ """
+
+ scores_df = pd.read_sql_query(query, db.conn)
+ logger.info("Loaded %d completed games with scores", len(scores_df))
+ return scores_df
+
+
+def calculate_clv_metrics(movements: pd.DataFrame, scores: pd.DataFrame) -> pd.DataFrame:
+ """Calculate CLV metrics by matching line movements to scores.
+
+ Args:
+ movements: Line movements DataFrame
+ scores: Scores DataFrame
+
+ Returns:
+ DataFrame with CLV analysis
+ """
+ # Merge on team names
+ merged = movements.merge(
+ scores,
+ on=["home_team", "away_team"],
+ how="left",
+ suffixes=("", "_score"),
+ )
+
+ results = []
+
+ for _, row in merged.iterrows():
+ if pd.isna(row["home_score"]) or pd.isna(row["away_score"]):
+ # No score available
+ continue
+
+ # Calculate actual margin (home team perspective)
+ actual_margin = row["home_score"] - row["away_score"]
+ actual_total = row["home_score"] + row["away_score"]
+
+ # Determine favorite/underdog
+ if row["close_favorite"] == row["home_team"]:
+ # Home is favorite
+ fav_is_home = True
+ close_spread = row["close_spread"] # Negative for favorite
+ else:
+ # Away is favorite
+ fav_is_home = False
+ close_spread = -row["close_spread"] # Make negative for away fav
+
+ # Calculate spread result vs. closing line
+ if fav_is_home:
+ spread_margin = actual_margin
+ fav_covers = spread_margin > close_spread
+ else:
+ spread_margin = -actual_margin
+ fav_covers = spread_margin > close_spread
+
+ # Calculate total result vs. closing line
+ over_covers = actual_total > row["close_total"]
+
+ # Analyze line movement accuracy (did sharp money call it right?)
+ spread_moved_to_fav = row["spread_movement"] > 0
+ spread_move_correct = (spread_moved_to_fav and fav_covers) or (
+ not spread_moved_to_fav and not fav_covers
+ )
+
+ total_moved_up = row["total_movement"] > 0
+ total_move_correct = (total_moved_up and over_covers) or (
+ not total_moved_up and not over_covers
+ )
+
+ results.append(
+ {
+ "game_id": row["game_id"],
+ "away_team": row["away_team"],
+ "home_team": row["home_team"],
+ "away_score": row["away_score"],
+ "home_score": row["home_score"],
+ "actual_margin": actual_margin,
+ "actual_total": actual_total,
+ # Lines
+ "open_spread": row["open_spread"],
+ "close_spread": row["close_spread"],
+ "spread_movement": row["spread_movement"],
+ "open_total": row["open_total"],
+ "close_total": row["close_total"],
+ "total_movement": row["total_movement"],
+ # Results
+ "favorite": row["close_favorite"],
+ "fav_covers_close": fav_covers,
+ "over_covers_close": over_covers,
+ # Sharp money analysis
+ "spread_move_correct": spread_move_correct,
+ "total_move_correct": total_move_correct,
+ "spread_moved_to_fav": spread_moved_to_fav,
+ "total_moved_up": total_moved_up,
+ }
+ )
+
+ return pd.DataFrame(results)
+
+
+def print_clv_summary(results: pd.DataFrame) -> None:
+ """Print CLV analysis summary.
+
+ Args:
+ results: Results DataFrame
+ """
+ if len(results) == 0:
+ print("\n[WARNING] No games with scores available for CLV analysis")
+ return
+
+ print("\n=== CLOSING LINE VALUE (CLV) ANALYSIS ===")
+ print(f"Games analyzed: {len(results)}")
+ print()
+
+ # Spread analysis
+ print("SPREAD PERFORMANCE:")
+ print("-" * 80)
+ fav_covers = results["fav_covers_close"].sum()
+ dog_covers = len(results) - fav_covers
+ fav_pct = 100 * fav_covers / len(results)
+ dog_pct = 100 * dog_covers / len(results)
+ print(f" Favorites covered: {fav_covers} / {len(results)} ({fav_pct:.1f}%)")
+ print(f" Underdogs covered: {dog_covers} / {len(results)} ({dog_pct:.1f}%)")
+ print()
+
+ # Total analysis
+ print("TOTAL PERFORMANCE:")
+ print("-" * 80)
+ overs = results["over_covers_close"].sum()
+ unders = len(results) - overs
+ print(f" Overs hit: {overs} / {len(results)} ({100 * overs / len(results):.1f}%)")
+ print(f" Unders hit: {unders} / {len(results)} ({100 * unders / len(results):.1f}%)")
+ print()
+
+ # Line movement accuracy
+ moved_spreads = results[results["spread_movement"] != 0]
+ if len(moved_spreads) > 0:
+ correct_spread_moves = moved_spreads["spread_move_correct"].sum()
+ print("SHARP MONEY ACCURACY (Spread):")
+ print("-" * 80)
+ move_pct = 100 * len(moved_spreads) / len(results)
+ correct_pct = 100 * correct_spread_moves / len(moved_spreads)
+ games_moved = f"{len(moved_spreads)} / {len(results)}"
+ print(f" Games with line movement: {games_moved} ({move_pct:.1f}%)")
+ sharp_correct = f"{correct_spread_moves} / {len(moved_spreads)}"
+ print(f" Sharp money correct: {sharp_correct} ({correct_pct:.1f}%)")
+ print()
+
+ moved_totals = results[results["total_movement"] != 0]
+ if len(moved_totals) > 0:
+ correct_total_moves = moved_totals["total_move_correct"].sum()
+ print("SHARP MONEY ACCURACY (Total):")
+ print("-" * 80)
+ move_pct = 100 * len(moved_totals) / len(results)
+ correct_pct = 100 * correct_total_moves / len(moved_totals)
+ games_moved = f"{len(moved_totals)} / {len(results)}"
+ print(f" Games with line movement: {games_moved} ({move_pct:.1f}%)")
+ sharp_correct = f"{correct_total_moves} / {len(moved_totals)}"
+ print(f" Sharp money correct: {sharp_correct} ({correct_pct:.1f}%)")
+ print()
+
+ # Biggest wins/losses
+ print("BIGGEST SPREAD COVERS:")
+ print("-" * 80)
+ results["spread_cover_margin"] = abs(results["actual_margin"] - results["close_spread"])
+ top_covers = results.nlargest(5, "spread_cover_margin")
+ for _, row in top_covers.iterrows():
+ print(
+ f" {row['away_team']:30s} {row['away_score']:3.0f} @ "
+ f"{row['home_team']:30s} {row['home_score']:3.0f}"
+ )
+ print(f" Closing line: {row['favorite']:30s} -{row['close_spread']:.1f}")
+ print(f" Actual margin: {row['actual_margin']:+.0f}")
+ print(f" Cover margin: {row['spread_cover_margin']:.1f} points")
+ print()
+
+
+def main() -> None:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(description="Analyze CLV results against actual game outcomes")
+ parser.add_argument(
+ "movements_file",
+ type=Path,
+ help="Line movements CSV file from compare_opening_closing_lines.py",
+ )
+ parser.add_argument(
+ "--db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to SQLite database with scores (default: data/odds_api/odds_api.sqlite3)",
+ )
+ parser.add_argument(
+ "--output", "-o", type=Path, help="Output CSV file path (default: print summary only)"
+ )
+ parser.add_argument("--verbose", "-v", action="store_true", help="Enable debug logging")
+
+ args = parser.parse_args()
+
+ # Configure logging
+ logging.basicConfig(
+ level=logging.DEBUG if args.verbose else logging.INFO,
+ format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+ )
+
+ # Load line movements
+ movements = pd.read_csv(args.movements_file)
+ logger.info("Loaded %d line movements from %s", len(movements), args.movements_file)
+
+ # Load scores
+ scores = load_scores(str(args.db))
+
+ if len(scores) == 0:
+ logger.warning("No scores available in database")
+ print("\n[WARNING] No completed games found in database")
+ print("Run score collection first:")
+ print(" uv run python scripts/backfill_espn_scores.py --start 2026-02-05 --end 2026-02-05")
+ return
+
+ # Calculate CLV metrics
+ results = calculate_clv_metrics(movements, scores)
+
+ if len(results) == 0:
+ logger.warning("No matching games with scores found")
+ print("\n[WARNING] Could not match line movements to scores")
+ print(f"Line movements: {len(movements)} games")
+ print(f"Scores available: {len(scores)} games")
+ print("Team names may not match - check team mapping")
+ return
+
+ # Output results
+ if args.output:
+ results.to_csv(args.output, index=False)
+ logger.info("Wrote %d CLV results to %s", len(results), args.output)
+
+ # Print summary
+ print_clv_summary(results)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/analyze_feature_importance.py b/scripts/analysis/analyze_feature_importance.py
new file mode 100644
index 000000000..c46823747
--- /dev/null
+++ b/scripts/analysis/analyze_feature_importance.py
@@ -0,0 +1,534 @@
+"""Analyze feature importance from trained XGBoost score regression models.
+
+This script loads the trained home/away score models and reports the
+most important KenPom statistics that predict game winners and scores.
+Uses both built-in XGBoost importance and SHAP values for interpretation.
+"""
+
+from __future__ import annotations
+
+import logging
+import pickle
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import shap
+import xgboost as xgb
+
+from sports_betting_edge.config.logging import configure_logging
+
+logger = logging.getLogger(__name__)
+
+# Feature names from train_score_models.py
+SCORE_FEATURES = [
+ # Home team KenPom stats
+ "home_adj_em",
+ "home_pythag",
+ "home_adj_o",
+ "home_adj_d",
+ "home_adj_t",
+ "home_luck",
+ "home_sos",
+ "home_efg_pct",
+ "home_to_pct",
+ "home_or_pct",
+ "home_ft_rate",
+ # Away team KenPom stats
+ "away_adj_em",
+ "away_pythag",
+ "away_adj_o",
+ "away_adj_d",
+ "away_adj_t",
+ "away_luck",
+ "away_sos",
+ "away_efg_pct",
+ "away_to_pct",
+ "away_or_pct",
+ "away_ft_rate",
+ # Combined features
+ "total_offense",
+ "avg_tempo",
+ "avg_luck",
+ "home_expected_pts",
+ "away_expected_pts",
+ "expected_total",
+ # Line features (if available)
+ "opening_total",
+ "closing_total",
+ "total_movement",
+]
+
+
+def load_model(model_path: Path) -> xgb.XGBRegressor:
+ """Load a pickled XGBoost regression model.
+
+ Args:
+ model_path: Path to .pkl model file
+
+ Returns:
+ Loaded XGBoost regressor
+ """
+ logger.info(f"Loading model from {model_path}...")
+ with open(model_path, "rb") as f:
+ model = pickle.load(f)
+ return model
+
+
+def analyze_builtin_importance(
+ model: xgb.XGBRegressor, model_name: str, features: list[str]
+) -> pd.DataFrame:
+ """Analyze feature importance using XGBoost's built-in metrics.
+
+ Args:
+ model: Trained XGBoost model
+ model_name: Name for logging (e.g., "Home Score")
+ features: List of feature names in order
+
+ Returns:
+ DataFrame with feature names and importance scores
+ """
+ logger.info(f"Analyzing built-in importance for {model_name} model...")
+
+ # Get feature importance (gain-based by default)
+ importance_dict = model.get_booster().get_score(importance_type="gain")
+
+ # Map feature indices to names (f0 -> actual_feature_name)
+ feature_map = {f"f{i}": name for i, name in enumerate(features)}
+
+ # Convert to DataFrame with actual feature names
+ importance_df = pd.DataFrame(
+ [
+ {
+ "feature": feature_map.get(k, k),
+ "feature_index": k,
+ "importance_gain": v,
+ }
+ for k, v in importance_dict.items()
+ ]
+ )
+
+ # Sort by importance
+ importance_df = importance_df.sort_values("importance_gain", ascending=False).reset_index(
+ drop=True
+ )
+
+ return importance_df
+
+
+def analyze_shap_importance(
+ model: xgb.XGBRegressor, X_sample: pd.DataFrame, model_name: str
+) -> tuple[pd.DataFrame, shap.Explanation]:
+ """Analyze feature importance using SHAP values.
+
+ Args:
+ model: Trained XGBoost model
+ X_sample: Sample data for SHAP analysis (use validation set)
+ model_name: Name for logging
+
+ Returns:
+ Tuple of (importance DataFrame, SHAP explanation object)
+ """
+ logger.info(f"Computing SHAP values for {model_name} model (may take a few minutes)...")
+
+ # Create SHAP explainer
+ explainer = shap.TreeExplainer(model)
+
+ # Compute SHAP values (use a sample to speed up)
+ sample_size = min(500, len(X_sample))
+ X_shap = X_sample.sample(n=sample_size, random_state=42)
+ shap_values = explainer.shap_values(X_shap)
+
+ # Calculate mean absolute SHAP value for each feature
+ mean_abs_shap = np.abs(shap_values).mean(axis=0)
+
+ # Create DataFrame
+ shap_df = pd.DataFrame(
+ {"feature": X_sample.columns, "mean_abs_shap": mean_abs_shap}
+ ).sort_values("mean_abs_shap", ascending=False)
+
+ # Create SHAP explanation object for plotting
+ explanation = shap.Explanation(
+ values=shap_values,
+ base_values=explainer.expected_value,
+ data=X_shap.values,
+ feature_names=X_sample.columns.tolist(),
+ )
+
+ return shap_df, explanation
+
+
+def categorize_features(importance_df: pd.DataFrame) -> dict[str, list[str]]:
+ """Categorize features by KenPom stat type.
+
+ Args:
+ importance_df: DataFrame with feature names and importances
+
+ Returns:
+ Dictionary mapping category names to feature lists
+ """
+ categories: dict[str, list[str]] = {
+ "Efficiency Metrics (AdjEM, AdjO, AdjD)": [],
+ "Four Factors (eFG%, TO%, OR%, FT Rate)": [],
+ "Tempo & Pace": [],
+ "Schedule & Luck": [],
+ "Expected Points & Totals": [],
+ "Betting Lines": [],
+ "Other": [],
+ }
+
+ for feature in importance_df["feature"]:
+ if "adj_em" in feature or "adj_o" in feature or "adj_d" in feature or "pythag" in feature:
+ categories["Efficiency Metrics (AdjEM, AdjO, AdjD)"].append(feature)
+ elif any(x in feature for x in ["efg_pct", "to_pct", "or_pct", "ft_rate", "defg", "dto"]):
+ categories["Four Factors (eFG%, TO%, OR%, FT Rate)"].append(feature)
+ elif "tempo" in feature or "adj_t" in feature:
+ categories["Tempo & Pace"].append(feature)
+ elif "sos" in feature or "luck" in feature:
+ categories["Schedule & Luck"].append(feature)
+ elif "expected" in feature or "total_offense" in feature:
+ categories["Expected Points & Totals"].append(feature)
+ elif any(x in feature for x in ["opening", "closing", "movement"]):
+ categories["Betting Lines"].append(feature)
+ else:
+ categories["Other"].append(feature)
+
+ return categories
+
+
+def print_top_features(importance_df: pd.DataFrame, n: int = 20) -> None:
+ """Print the top N most important features.
+
+ Args:
+ importance_df: DataFrame with feature names and importances
+ n: Number of top features to print
+ """
+ logger.info(f"\n{'=' * 80}")
+ logger.info(f"TOP {n} MOST IMPORTANT FEATURES")
+ logger.info("=" * 80)
+
+ total_importance = importance_df["importance_gain"].sum()
+
+ for idx, row in importance_df.head(n).iterrows():
+ feature = row["feature"]
+ importance = row["importance_gain"]
+ pct = (importance / total_importance) * 100
+
+ logger.info(f"{idx + 1:2d}. {feature:40s} {importance:12.2f} ({pct:5.2f}%)")
+
+
+def print_category_summary(importance_df: pd.DataFrame, categories: dict[str, list[str]]) -> None:
+ """Print importance summary by feature category.
+
+ Args:
+ importance_df: DataFrame with feature names and importances
+ categories: Dictionary mapping category names to feature lists
+ """
+ logger.info(f"\n{'=' * 80}")
+ logger.info("FEATURE IMPORTANCE BY CATEGORY")
+ logger.info("=" * 80)
+
+ total_importance = importance_df["importance_gain"].sum()
+
+ # Calculate total importance per category
+ category_scores = {}
+ for category, features in categories.items():
+ if not features:
+ continue
+
+ category_importance = importance_df[importance_df["feature"].isin(features)][
+ "importance_gain"
+ ].sum()
+
+ category_scores[category] = category_importance
+
+ # Sort categories by importance
+ sorted_categories = sorted(category_scores.items(), key=lambda x: x[1], reverse=True)
+
+ for category, importance in sorted_categories:
+ pct = (importance / total_importance) * 100
+ feature_count = len(categories[category])
+ logger.info(f"\n{category} ({feature_count} features): {importance:12.2f} ({pct:5.2f}%)")
+
+ # Show top 3 features in this category
+ category_features = importance_df[importance_df["feature"].isin(categories[category])].head(
+ 3
+ )
+
+ for _, row in category_features.iterrows():
+ feature = row["feature"]
+ feat_importance = row["importance_gain"]
+ feat_pct = (feat_importance / total_importance) * 100
+ logger.info(f" - {feature:38s} {feat_importance:10.2f} ({feat_pct:5.2f}%)")
+
+
+def load_sample_data() -> pd.DataFrame:
+ """Load sample data for SHAP analysis from training script.
+
+ Returns:
+ DataFrame with features (no targets)
+ """
+ from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+ logger.info("Loading sample data for SHAP analysis...")
+
+ staging_path = Path("data/staging")
+ engineer = FeatureEngineer(staging_path=str(staging_path))
+
+ # Load recent data (last 60 days)
+ from datetime import datetime, timedelta
+
+ end_date = datetime.now().date()
+ start_date = end_date - timedelta(days=60)
+
+ merged = engineer.load_staging_data(
+ start_date, end_date, season=2026, require_line_features=False, use_home_away=True
+ )
+
+ # Build features (same as train_score_models.py)
+ X = pd.DataFrame()
+
+ # Home team features
+ X["home_adj_em"] = merged["adj_em_home"]
+ X["home_pythag"] = merged["pythag_home"]
+ X["home_adj_o"] = merged["adj_o_home"]
+ X["home_adj_d"] = merged["adj_d_home"]
+ X["home_adj_t"] = merged["adj_t_home"]
+ X["home_luck"] = merged["luck_home"]
+ X["home_sos"] = merged["sos_home"]
+ X["home_efg_pct"] = merged["efg_pct_home"]
+ X["home_to_pct"] = merged["to_pct_home"]
+ X["home_or_pct"] = merged["or_pct_home"]
+ X["home_ft_rate"] = merged["ft_rate_home"]
+
+ # Away team features
+ X["away_adj_em"] = merged["adj_em_away"]
+ X["away_pythag"] = merged["pythag_away"]
+ X["away_adj_o"] = merged["adj_o_away"]
+ X["away_adj_d"] = merged["adj_d_away"]
+ X["away_adj_t"] = merged["adj_t_away"]
+ X["away_luck"] = merged["luck_away"]
+ X["away_sos"] = merged["sos_away"]
+ X["away_efg_pct"] = merged["efg_pct_away"]
+ X["away_to_pct"] = merged["to_pct_away"]
+ X["away_or_pct"] = merged["or_pct_away"]
+ X["away_ft_rate"] = merged["ft_rate_away"]
+
+ # Combined features
+ X["total_offense"] = X["home_adj_o"] + X["away_adj_o"]
+ X["avg_tempo"] = (X["home_adj_t"] + X["away_adj_t"]) / 2
+ X["avg_luck"] = (X["home_luck"] + X["away_luck"]) / 2
+ X["home_expected_pts"] = (X["home_adj_o"] * X["away_adj_d"] / 100) * (X["home_adj_t"] / 100)
+ X["away_expected_pts"] = (X["away_adj_o"] * X["home_adj_d"] / 100) * (X["away_adj_t"] / 100)
+ X["expected_total"] = X["home_expected_pts"] + X["away_expected_pts"]
+
+ # Add line features if available
+ if "opening_total" in merged.columns:
+ X["opening_total"] = merged["opening_total"]
+ X["closing_total"] = merged["closing_total"]
+ X["total_movement"] = merged["closing_total"] - merged["opening_total"]
+
+ # Drop rows with missing values
+ X = X.dropna()
+
+ logger.info(f"Loaded {len(X)} games for SHAP analysis")
+
+ return X
+
+
+def create_comparison_plot(
+ home_importance: pd.DataFrame, away_importance: pd.DataFrame, output_path: Path
+) -> None:
+ """Create comparison plot of home vs away feature importance.
+
+ Args:
+ home_importance: Home model importance DataFrame
+ away_importance: Away model importance DataFrame
+ output_path: Path to save the plot
+ """
+ # Merge home and away importance
+ comparison = home_importance.merge(away_importance, on="feature", suffixes=("_home", "_away"))
+
+ # Take top 15 features by total importance
+ comparison["total_importance"] = (
+ comparison["importance_gain_home"] + comparison["importance_gain_away"]
+ )
+ top_features = comparison.nlargest(15, "total_importance")
+
+ # Create plot
+ fig, ax = plt.subplots(figsize=(12, 8))
+
+ x = np.arange(len(top_features))
+ width = 0.35
+
+ ax.barh(
+ x - width / 2,
+ top_features["importance_gain_home"],
+ width,
+ label="Home Score Model",
+ color="#2E86AB",
+ )
+ ax.barh(
+ x + width / 2,
+ top_features["importance_gain_away"],
+ width,
+ label="Away Score Model",
+ color="#A23B72",
+ )
+
+ ax.set_yticks(x)
+ ax.set_yticklabels(top_features["feature"])
+ ax.set_xlabel("Feature Importance (Gain)", fontsize=12)
+ ax.set_title("Top 15 Features: Home vs Away Score Prediction", fontsize=14, fontweight="bold")
+ ax.legend(loc="lower right")
+ ax.grid(axis="x", alpha=0.3)
+
+ plt.tight_layout()
+ plt.savefig(output_path, dpi=300, bbox_inches="tight")
+ logger.info(f"Comparison plot saved to: {output_path}")
+
+
+def main() -> None:
+ """Analyze feature importance for home and away score models."""
+ configure_logging()
+
+ models_dir = Path("models")
+ output_dir = Path("reports")
+ output_dir.mkdir(exist_ok=True)
+
+ # Load models
+ home_model = load_model(models_dir / "home_score_2026.pkl")
+ away_model = load_model(models_dir / "away_score_2026.pkl")
+
+ # Load sample data for SHAP analysis
+ X_sample = load_sample_data()
+
+ # Determine actual features (may vary if line features missing)
+ actual_features = X_sample.columns.tolist()
+
+ logger.info(f"\nAnalyzing {len(actual_features)} features:")
+ for i, feat in enumerate(actual_features, 1):
+ logger.info(f" {i:2d}. {feat}")
+
+ # ============================================================================
+ # HOME SCORE MODEL ANALYSIS
+ # ============================================================================
+ logger.info("\n" + "=" * 80)
+ logger.info("HOME SCORE MODEL - BUILT-IN IMPORTANCE")
+ logger.info("=" * 80)
+
+ home_builtin = analyze_builtin_importance(home_model, "Home Score", actual_features)
+ print_top_features(home_builtin, n=15)
+
+ home_categories = categorize_features(home_builtin)
+ print_category_summary(home_builtin, home_categories)
+
+ logger.info("\n" + "=" * 80)
+ logger.info("HOME SCORE MODEL - SHAP ANALYSIS")
+ logger.info("=" * 80)
+
+ try:
+ home_shap_df, home_shap_exp = analyze_shap_importance(home_model, X_sample, "Home Score")
+ print_top_features(home_shap_df.rename(columns={"mean_abs_shap": "importance_gain"}), n=15)
+ except Exception as e:
+ logger.warning(f"SHAP analysis failed (likely compatibility issue): {e}")
+ logger.warning("Continuing without SHAP analysis...")
+ home_shap_df, home_shap_exp = None, None
+
+ # ============================================================================
+ # AWAY SCORE MODEL ANALYSIS
+ # ============================================================================
+ logger.info("\n\n" + "=" * 80)
+ logger.info("AWAY SCORE MODEL - BUILT-IN IMPORTANCE")
+ logger.info("=" * 80)
+
+ away_builtin = analyze_builtin_importance(away_model, "Away Score", actual_features)
+ print_top_features(away_builtin, n=15)
+
+ away_categories = categorize_features(away_builtin)
+ print_category_summary(away_builtin, away_categories)
+
+ logger.info("\n" + "=" * 80)
+ logger.info("AWAY SCORE MODEL - SHAP ANALYSIS")
+ logger.info("=" * 80)
+
+ try:
+ away_shap_df, away_shap_exp = analyze_shap_importance(away_model, X_sample, "Away Score")
+ print_top_features(away_shap_df.rename(columns={"mean_abs_shap": "importance_gain"}), n=15)
+ except Exception as e:
+ logger.warning(f"SHAP analysis failed (likely compatibility issue): {e}")
+ logger.warning("Continuing without SHAP analysis...")
+ away_shap_df, away_shap_exp = None, None
+
+ # ============================================================================
+ # SAVE RESULTS
+ # ============================================================================
+ logger.info("\n" + "=" * 80)
+ logger.info("SAVING RESULTS")
+ logger.info("=" * 80)
+
+ # Save CSV reports
+ home_builtin.to_csv(output_dir / "home_score_builtin_importance.csv", index=False)
+ away_builtin.to_csv(output_dir / "away_score_builtin_importance.csv", index=False)
+
+ if home_shap_df is not None:
+ home_shap_df.to_csv(output_dir / "home_score_shap_importance.csv", index=False)
+ if away_shap_df is not None:
+ away_shap_df.to_csv(output_dir / "away_score_shap_importance.csv", index=False)
+
+ # Create comparison plot
+ create_comparison_plot(home_builtin, away_builtin, output_dir / "importance_comparison.png")
+
+ # Create SHAP summary plots (if available)
+ if home_shap_exp is not None:
+ plt.figure(figsize=(10, 8))
+ shap.summary_plot(home_shap_exp, show=False, max_display=15)
+ plt.title("Home Score Model - SHAP Feature Importance", fontsize=14, fontweight="bold")
+ plt.tight_layout()
+ plt.savefig(output_dir / "home_shap_summary.png", dpi=300, bbox_inches="tight")
+ logger.info(f"Home SHAP summary saved to: {output_dir / 'home_shap_summary.png'}")
+
+ if away_shap_exp is not None:
+ plt.figure(figsize=(10, 8))
+ shap.summary_plot(away_shap_exp, show=False, max_display=15)
+ plt.title("Away Score Model - SHAP Feature Importance", fontsize=14, fontweight="bold")
+ plt.tight_layout()
+ plt.savefig(output_dir / "away_shap_summary.png", dpi=300, bbox_inches="tight")
+ logger.info(f"Away SHAP summary saved to: {output_dir / 'away_shap_summary.png'}")
+
+ # KEY INSIGHTS
+ logger.info("\n" + "=" * 80)
+ logger.info("KEY INSIGHTS - WHAT PREDICTS WINNERS?")
+ logger.info("=" * 80)
+
+ # Top 5 features for each model
+ top5_home = home_builtin.head(5)["feature"].tolist()
+ top5_away = away_builtin.head(5)["feature"].tolist()
+
+ logger.info("\nTop 5 KenPom Stats for Predicting HOME Team Score:")
+ for i, feat in enumerate(top5_home, 1):
+ logger.info(f" {i}. {feat}")
+
+ logger.info("\nTop 5 KenPom Stats for Predicting AWAY Team Score:")
+ for i, feat in enumerate(top5_away, 1):
+ logger.info(f" {i}. {feat}")
+
+ # Category importance for home model
+ logger.info("\nMost Important Stat Categories (Home Model):")
+ total_importance = home_builtin["importance_gain"].sum()
+ for category, features in home_categories.items():
+ if not features:
+ continue
+ cat_importance = home_builtin[home_builtin["feature"].isin(features)][
+ "importance_gain"
+ ].sum()
+ pct = (cat_importance / total_importance) * 100
+ logger.info(f" - {category}: {pct:.1f}%")
+
+ logger.info("\n[OK] Feature importance analysis complete!")
+ logger.info(f"\nAll results saved to: {output_dir.absolute()}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/analyze_kenpom_feature_importance.py b/scripts/analysis/analyze_kenpom_feature_importance.py
new file mode 100644
index 000000000..333f9db2d
--- /dev/null
+++ b/scripts/analysis/analyze_kenpom_feature_importance.py
@@ -0,0 +1,325 @@
+"""Analyze KenPom feature importance for predicting game outcomes.
+
+Evaluates which KenPom metrics are most predictive of:
+- Game winners
+- Spread coverage
+- Over/under totals
+- Actual scores
+
+Uses XGBoost feature importance to rank metrics.
+
+Usage:
+ python scripts/analysis/analyze_kenpom_feature_importance.py
+ python scripts/analysis/analyze_kenpom_feature_importance.py --min-games 100
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from pathlib import Path
+
+import pandas as pd
+import xgboost as xgb
+from sklearn.model_selection import train_test_split
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+def load_kenpom_team_data() -> pd.DataFrame:
+ """Load KenPom team ratings from staging.
+
+ Returns:
+ DataFrame with team ratings
+ """
+ logger.info("Loading KenPom team ratings...")
+
+ team_ratings = read_parquet_df("data/staging/team_ratings.parquet")
+ logger.info(f" Loaded {len(team_ratings)} team records")
+
+ return team_ratings
+
+
+def build_feature_dataset(db: OddsAPIDatabase, team_ratings: pd.DataFrame) -> pd.DataFrame:
+ """Build dataset with KenPom features and game outcomes.
+
+ Args:
+ db: Database with game results
+ team_ratings: KenPom team ratings
+
+ Returns:
+ DataFrame with features and labels
+ """
+ logger.info("Building feature dataset...")
+
+ # Get completed games with scores
+ query = """
+ SELECT
+ e.event_id,
+ e.home_team,
+ e.away_team,
+ s.home_score,
+ s.away_score,
+ DATE(e.commence_time) as game_date
+ FROM events e
+ INNER JOIN scores s ON e.event_id = s.event_id
+ WHERE s.home_score IS NOT NULL
+ AND s.away_score IS NOT NULL
+ AND s.completed = 1
+ ORDER BY e.commence_time
+ """
+
+ games_df = pd.read_sql_query(query, db.conn)
+ logger.info(f" Found {len(games_df)} completed games")
+
+ # Merge with KenPom ratings
+ games_df = games_df.merge(
+ team_ratings,
+ left_on="home_team",
+ right_on="odds_api_name",
+ how="left",
+ suffixes=("", "_home"),
+ )
+
+ games_df = games_df.merge(
+ team_ratings,
+ left_on="away_team",
+ right_on="odds_api_name",
+ how="left",
+ suffixes=("_home", "_away"),
+ )
+
+ # Drop games without KenPom data
+ before_count = len(games_df)
+ games_df = games_df.dropna(subset=["adj_em_home", "adj_em_away"])
+ logger.info(f" Dropped {before_count - len(games_df)} games without KenPom data")
+
+ # Calculate outcomes
+ games_df["actual_home_score"] = games_df["home_score"]
+ games_df["actual_away_score"] = games_df["away_score"]
+ games_df["actual_spread"] = games_df["home_score"] - games_df["away_score"]
+ games_df["actual_total"] = games_df["home_score"] + games_df["away_score"]
+ games_df["home_won"] = (games_df["actual_spread"] > 0).astype(int)
+
+ # Calculate differential features
+ games_df["adj_em_diff"] = games_df["adj_em_home"] - games_df["adj_em_away"]
+ games_df["pythag_diff"] = games_df["pythag_home"] - games_df["pythag_away"]
+ games_df["adj_o_diff"] = games_df["adj_o_home"] - games_df["adj_o_away"]
+ games_df["adj_d_diff"] = games_df["adj_d_home"] - games_df["adj_d_away"]
+ games_df["adj_t_diff"] = games_df["adj_t_home"] - games_df["adj_t_away"]
+ games_df["sos_diff"] = games_df["sos_home"] - games_df["sos_away"]
+ games_df["luck_diff"] = games_df["luck_home"] - games_df["luck_away"]
+
+ # Four Factors differentials
+ games_df["efg_pct_diff"] = games_df["efg_pct_home"] - games_df["efg_pct_away"]
+ games_df["to_pct_diff"] = games_df["to_pct_home"] - games_df["to_pct_away"]
+ games_df["or_pct_diff"] = games_df["or_pct_home"] - games_df["or_pct_away"]
+ games_df["ft_rate_diff"] = games_df["ft_rate_home"] - games_df["ft_rate_away"]
+
+ # Combined metrics
+ games_df["total_offense"] = games_df["adj_o_home"] + games_df["adj_o_away"]
+ games_df["avg_tempo"] = (games_df["adj_t_home"] + games_df["adj_t_away"]) / 2
+ games_df["avg_defense"] = (games_df["adj_d_home"] + games_df["adj_d_away"]) / 2
+
+ logger.info(f" Built dataset with {len(games_df)} games")
+
+ return games_df
+
+
+def analyze_feature_importance(df: pd.DataFrame, target: str, feature_type: str) -> pd.DataFrame:
+ """Analyze feature importance for a specific target.
+
+ Args:
+ df: Dataset with features
+ target: Target column name
+ feature_type: Type of features to use
+
+ Returns:
+ DataFrame with feature importance rankings
+ """
+ logger.info(f"\nAnalyzing feature importance for {target}...")
+
+ # Define feature sets
+ all_kenpom_features = [
+ # Efficiency metrics
+ "adj_em_home",
+ "adj_em_away",
+ "adj_em_diff",
+ "pythag_home",
+ "pythag_away",
+ "pythag_diff",
+ # Offense/Defense
+ "adj_o_home",
+ "adj_o_away",
+ "adj_o_diff",
+ "adj_d_home",
+ "adj_d_away",
+ "adj_d_diff",
+ # Tempo
+ "adj_t_home",
+ "adj_t_away",
+ "adj_t_diff",
+ "avg_tempo",
+ # Strength of Schedule
+ "sos_home",
+ "sos_away",
+ "sos_diff",
+ # Luck
+ "luck_home",
+ "luck_away",
+ "luck_diff",
+ # Four Factors
+ "efg_pct_home",
+ "efg_pct_away",
+ "efg_pct_diff",
+ "to_pct_home",
+ "to_pct_away",
+ "to_pct_diff",
+ "or_pct_home",
+ "or_pct_away",
+ "or_pct_diff",
+ "ft_rate_home",
+ "ft_rate_away",
+ "ft_rate_diff",
+ # Combined
+ "total_offense",
+ "avg_defense",
+ ]
+
+ # Select features available in dataset
+ features = [f for f in all_kenpom_features if f in df.columns]
+ logger.info(f" Using {len(features)} features")
+
+ # Prepare data
+ X = df[features].fillna(df[features].median())
+ y = df[target]
+
+ # Split data
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
+
+ # Train model
+ if target in ["home_won"]:
+ # Classification
+ model = xgb.XGBClassifier(
+ n_estimators=100,
+ max_depth=6,
+ learning_rate=0.1,
+ random_state=42,
+ )
+ model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
+ score = model.score(X_test, y_test)
+ metric_name = "Accuracy"
+ else:
+ # Regression
+ model = xgb.XGBRegressor(
+ n_estimators=100,
+ max_depth=6,
+ learning_rate=0.1,
+ random_state=42,
+ )
+ model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
+ from sklearn.metrics import mean_absolute_error
+
+ score = mean_absolute_error(y_test, model.predict(X_test))
+ metric_name = "MAE"
+
+ logger.info(f" Test {metric_name}: {score:.4f}")
+
+ # Get feature importance
+ importance_dict = model.get_booster().get_score(importance_type="gain")
+ importance_values = [importance_dict.get(f"f{i}", 0.0) for i in range(len(features))]
+
+ importance_df = pd.DataFrame({"feature": features, "importance": importance_values})
+ importance_df = importance_df.sort_values("importance", ascending=False)
+ importance_df["importance_pct"] = (
+ 100 * importance_df["importance"] / importance_df["importance"].sum()
+ )
+
+ # Show top features
+ logger.info("\n Top 10 Features:")
+ for _, row in importance_df.head(10).iterrows():
+ logger.info(f" {row['feature']:30s} {row['importance_pct']:6.2f}%")
+
+ return importance_df
+
+
+def main() -> None:
+ """Entry point."""
+ parser = argparse.ArgumentParser(description="Analyze KenPom feature importance")
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Database path",
+ )
+ parser.add_argument(
+ "--min-games",
+ type=int,
+ default=50,
+ help="Minimum games required for analysis",
+ )
+ parser.add_argument(
+ "--output-dir",
+ type=Path,
+ default=Path("analysis"),
+ help="Output directory for results",
+ )
+
+ args = parser.parse_args()
+
+ logger.info("[OK] === KenPom Feature Importance Analysis ===\n")
+
+ # Load data
+ db = OddsAPIDatabase(args.db_path)
+ team_ratings = load_kenpom_team_data()
+ df = build_feature_dataset(db, team_ratings)
+
+ if len(df) < args.min_games:
+ logger.error(f"Insufficient data: {len(df)} games (min: {args.min_games})")
+ return
+
+ # Analyze different targets
+ targets = {
+ "home_won": "Win Prediction",
+ "actual_spread": "Spread Prediction",
+ "actual_total": "Total Prediction",
+ "actual_home_score": "Home Score Prediction",
+ "actual_away_score": "Away Score Prediction",
+ }
+
+ all_results = {}
+ for target, description in targets.items():
+ logger.info(f"\n{'=' * 60}")
+ logger.info(f"{description}")
+ logger.info("=" * 60)
+
+ importance_df = analyze_feature_importance(df, target, description)
+ all_results[target] = importance_df
+
+ # Save individual results
+ output_file = args.output_dir / f"kenpom_importance_{target}.csv"
+ args.output_dir.mkdir(parents=True, exist_ok=True)
+ importance_df.to_csv(output_file, index=False)
+ logger.info(f"\n Saved to {output_file}")
+
+ # Create summary report
+ logger.info("\n" + "=" * 60)
+ logger.info("SUMMARY: Top 5 Features by Target")
+ logger.info("=" * 60)
+
+ for target, description in targets.items():
+ logger.info(f"\n{description}:")
+ top5 = all_results[target].head(5)
+ for _, row in top5.iterrows():
+ logger.info(f" {row['feature']:30s} {row['importance_pct']:6.2f}%")
+
+ logger.info("\n[OK] Analysis complete!")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/analyze_kenpom_vs_odds.py b/scripts/analysis/analyze_kenpom_vs_odds.py
new file mode 100644
index 000000000..fcc6315de
--- /dev/null
+++ b/scripts/analysis/analyze_kenpom_vs_odds.py
@@ -0,0 +1,359 @@
+#!/usr/bin/env python3
+"""Analyze KenPom ratings vs Overtime odds to identify betting edges.
+
+Loads KenPom efficiency data and Overtime betting lines. Compares KenPom win
+probability (P(win game)) to market win probability derived from the spread
+magnitude (spread_to_win_prob). Both sides are win probabilities; do not use
+spread-cover implied prob (from juice), which would be invalid to compare.
+
+Output is written to data/analysis/ by default:
+ data/analysis/analysis_YYYY-MM-DD.csv
+
+Usage:
+ uv run python scripts/analyze_kenpom_vs_odds.py
+ uv run python scripts/analyze_kenpom_vs_odds.py --min-edge 3.0
+ uv run python scripts/analyze_kenpom_vs_odds.py --output \\
+ data/analysis/analysis_2026-01-31_calibrated.csv
+"""
+
+from __future__ import annotations
+
+import argparse
+from datetime import date
+from pathlib import Path
+
+import pyarrow.parquet as pq
+
+from sports_betting_edge.utils.team_matching import match_to_kenpom
+
+
+def american_to_implied_prob(odds: int) -> float:
+ """Convert American odds to implied probability.
+
+ Args:
+ odds: American odds (e.g., -110, +150)
+
+ Returns:
+ Implied probability as decimal (0-1)
+ """
+ if odds < 0:
+ return abs(odds) / (abs(odds) + 100)
+ else:
+ return 100 / (odds + 100)
+
+
+def spread_to_win_prob(spread: float, is_favorite: bool) -> float:
+ """Rough conversion of spread to win probability using historical data.
+
+ Uses empirical relationship: P(win) ≈ 0.5 + (spread / 25) for favorites.
+
+ Args:
+ spread: Spread magnitude (always positive)
+ is_favorite: True if team is favorite
+
+ Returns:
+ Estimated win probability (0-1)
+ """
+ if is_favorite:
+ return 0.50 + (spread / 25)
+ else:
+ return 0.50 - (spread / 25)
+
+
+def kenpom_win_probability(team_rating: float, opp_rating: float) -> float:
+ """Calculate win probability from KenPom efficiency ratings.
+
+ Uses log5 method based on efficiency margin.
+
+ Args:
+ team_rating: Team's adjusted efficiency margin (AdjEM)
+ opp_rating: Opponent's adjusted efficiency margin (AdjEM)
+
+ Returns:
+ Win probability (0-1)
+ """
+ # Convert efficiency margins to win probability via pythagorean expectation
+ # Simplified: higher efficiency margin = better team
+ diff = team_rating - opp_rating
+
+ # Log5 approximation: P = 1 / (1 + 10^(-diff/19.5))
+ # Constant 19.5 calibrated from KenPom FanMatch data (MAE: 9.32%)
+ win_prob = 1 / (1 + 10 ** (-diff / 19.5))
+ return win_prob
+
+
+def load_kenpom_ratings(kenpom_dir: Path, date_str: str) -> dict[str, dict[str, float]]:
+ """Load KenPom ratings for specified date.
+
+ Args:
+ kenpom_dir: Directory containing KenPom parquet files
+ date_str: Date string (YYYY-MM-DD)
+
+ Returns:
+ Dict mapping team name to ratings: {team: {AdjEM, AdjO, AdjD}}
+ """
+ parquet_path = kenpom_dir / "ratings" / f"{date_str}.parquet"
+ if not parquet_path.exists():
+ raise FileNotFoundError(f"KenPom ratings not found: {parquet_path}")
+
+ table = pq.read_table(parquet_path)
+ ratings = {}
+
+ for i in range(len(table)):
+ team = table["TeamName"][i].as_py()
+ ratings[team] = {
+ "AdjEM": table["AdjEM"][i].as_py(),
+ "AdjOE": table["AdjOE"][i].as_py(),
+ "AdjDE": table["AdjDE"][i].as_py(),
+ }
+
+ return ratings
+
+
+def load_overtime_odds(overtime_dir: Path, date_str: str) -> list[dict]:
+ """Load Overtime odds for specified date.
+
+ Args:
+ overtime_dir: Directory containing Overtime parquet files
+ date_str: Date string (YYYY-MM-DD)
+
+ Returns:
+ List of game line dicts
+ """
+ parquet_path = overtime_dir / f"{date_str}.parquet"
+ if not parquet_path.exists():
+ raise FileNotFoundError(f"Overtime odds not found: {parquet_path}")
+
+ table = pq.read_table(parquet_path)
+ return table.to_pylist()
+
+
+def fuzzy_match_team(team_name: str, kenpom_ratings: dict[str, dict]) -> str | None:
+ """Match Overtime team name to KenPom team name.
+
+ Uses manual mappings and fuzzy matching via team_matching utility.
+
+ Args:
+ team_name: Team name from Overtime (e.g., "Texas Tech", "Massachusetts")
+ kenpom_ratings: Dict of KenPom ratings keyed by team name
+
+ Returns:
+ Matched KenPom team name or None
+ """
+ kenpom_teams = list(kenpom_ratings.keys())
+ return match_to_kenpom(team_name, kenpom_teams, threshold=0.85)
+
+
+def analyze_edges(
+ kenpom_ratings: dict[str, dict[str, float]],
+ overtime_odds: list[dict],
+ min_edge: float = 2.0,
+) -> list[dict]:
+ """Analyze betting edges by comparing KenPom vs market odds.
+
+ Args:
+ kenpom_ratings: Dict of KenPom ratings keyed by team name
+ overtime_odds: List of game line dicts
+ min_edge: Minimum edge threshold (%) to report
+
+ Returns:
+ List of dicts with edge analysis, sorted by best_edge descending
+ """
+ results = []
+
+ for game in overtime_odds:
+ away_team = game["away_team"]
+ home_team = game["home_team"]
+
+ # Match teams to KenPom
+ away_kenpom = fuzzy_match_team(away_team, kenpom_ratings)
+ home_kenpom = fuzzy_match_team(home_team, kenpom_ratings)
+
+ if not away_kenpom or not home_kenpom:
+ continue
+
+ # Get KenPom ratings
+ away_rating = kenpom_ratings[away_kenpom]["AdjEM"]
+ home_rating = kenpom_ratings[home_kenpom]["AdjEM"]
+
+ # Calculate KenPom win probabilities (home court advantage ~3.5 points)
+ hca_adjustment = 3.5 # Home court advantage in efficiency points
+ away_kenpom_prob = kenpom_win_probability(away_rating, home_rating + hca_adjustment)
+ home_kenpom_prob = kenpom_win_probability(home_rating + hca_adjustment, away_rating)
+
+ # Get spread and derive market *win* probability (not cover probability)
+ spread_mag = game.get("spread_magnitude")
+ favorite_team = game.get("favorite_team")
+
+ if spread_mag is not None and favorite_team:
+ is_away_fav = favorite_team == away_team
+
+ # Market win probability from spread magnitude (P(win), not P(cover)).
+ # Spread prices (juice) imply P(cover) ~50% each; comparing those to
+ # KenPom P(win) would be invalid (e.g. 17-pt dog has ~5% P(win) but ~50% P(cover)).
+ away_market_prob = spread_to_win_prob(spread_mag, is_away_fav)
+ home_market_prob = spread_to_win_prob(spread_mag, not is_away_fav)
+
+ # Calculate edges (KenPom win prob - market win prob from spread)
+ away_edge = (away_kenpom_prob - away_market_prob) * 100
+ home_edge = (home_kenpom_prob - home_market_prob) * 100
+
+ # Report games with significant edges
+ if abs(away_edge) >= min_edge or abs(home_edge) >= min_edge:
+ results.append(
+ {
+ "away_team": away_team,
+ "home_team": home_team,
+ "game_time": game.get("game_time_str", ""),
+ "spread": f"{'-' if is_away_fav else '+'}{spread_mag}",
+ "favorite": favorite_team,
+ "away_kenpom_rating": round(away_rating, 2),
+ "home_kenpom_rating": round(home_rating, 2),
+ "away_kenpom_prob": round(away_kenpom_prob * 100, 1),
+ "home_kenpom_prob": round(home_kenpom_prob * 100, 1),
+ "away_market_prob": round(away_market_prob * 100, 1),
+ "home_market_prob": round(home_market_prob * 100, 1),
+ "away_edge": round(away_edge, 1),
+ "home_edge": round(home_edge, 1),
+ "best_bet": away_team if away_edge > home_edge else home_team,
+ "best_edge": round(max(away_edge, home_edge), 1),
+ }
+ )
+
+ # Sort by best_edge descending
+ results.sort(key=lambda x: x["best_edge"], reverse=True)
+ return results
+
+
+def format_table(edges: list[dict]) -> str:
+ """Format edges list as a simple table."""
+ if not edges:
+ return ""
+
+ # Define columns and widths
+ cols = [
+ ("away_team", 25),
+ ("home_team", 25),
+ ("game_time", 10),
+ ("spread", 8),
+ ("best_bet", 25),
+ ("best_edge", 10),
+ ("away_edge", 10),
+ ("home_edge", 10),
+ ]
+
+ # Build header
+ header = " ".join(f"{col[0]:{col[1]}}" for col in cols)
+ separator = "-" * len(header)
+
+ # Build rows
+ rows = []
+ for edge in edges:
+ row = " ".join(f"{str(edge.get(col[0], '')):{col[1]}}"[: col[1]] for col in cols)
+ rows.append(row)
+
+ return f"{header}\n{separator}\n" + "\n".join(rows)
+
+
+def main() -> int:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(
+ description="Analyze KenPom ratings vs Overtime odds for betting edges"
+ )
+ parser.add_argument(
+ "--date",
+ type=str,
+ default=date.today().isoformat(),
+ help="Date to analyze (YYYY-MM-DD, default: today)",
+ )
+ parser.add_argument(
+ "--kenpom-dir",
+ type=Path,
+ default=Path("./data/kenpom"),
+ help="KenPom data directory",
+ )
+ parser.add_argument(
+ "--overtime-dir",
+ type=Path,
+ default=Path("./data/overtime"),
+ help="Overtime data directory",
+ )
+ parser.add_argument(
+ "--min-edge",
+ type=float,
+ default=2.0,
+ help="Minimum edge threshold (percentage points)",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=None,
+ help="Output CSV path (default: data/analysis/analysis_.csv)",
+ )
+ parser.add_argument(
+ "--output-dir",
+ type=Path,
+ default=Path("./data/analysis"),
+ help="Output directory for analysis CSV (default: data/analysis)",
+ )
+
+ args = parser.parse_args()
+
+ # Default output path when --output not provided
+ output_path = args.output or (args.output_dir / f"analysis_{args.date}.csv")
+
+ try:
+ print(f"\n[OK] Loading data for {args.date}...")
+
+ kenpom_ratings = load_kenpom_ratings(args.kenpom_dir, args.date)
+ print(f"[OK] Loaded {len(kenpom_ratings)} KenPom team ratings")
+
+ overtime_odds = load_overtime_odds(args.overtime_dir, args.date)
+ print(f"[OK] Loaded {len(overtime_odds)} Overtime game lines")
+
+ print(f"\n[OK] Analyzing edges (min threshold: {args.min_edge}%)...\n")
+ edges = analyze_edges(kenpom_ratings, overtime_odds, args.min_edge)
+
+ if not edges:
+ print(f"[WARNING] No games found with edge >= {args.min_edge}%")
+ return 0
+
+ print(f"[OK] Found {len(edges)} games with significant edges:\n")
+ print(format_table(edges))
+
+ # Always write to standardized output path
+ import csv
+
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ with open(output_path, "w", newline="", encoding="utf-8") as f:
+ writer = csv.DictWriter(f, fieldnames=edges[0].keys())
+ writer.writeheader()
+ writer.writerows(edges)
+ print(f"\n[OK] Saved analysis to {output_path}")
+
+ # Summary statistics
+ print("\n--- Summary ---")
+ print(f"Total games analyzed: {len(overtime_odds)}")
+ print(f"Games with edges >= {args.min_edge}%: {len(edges)}")
+ avg_edge = sum(e["best_edge"] for e in edges) / len(edges)
+ max_edge = max(e["best_edge"] for e in edges)
+ print(f"Average edge: {avg_edge:.1f}%")
+ print(f"Max edge: {max_edge:.1f}%")
+
+ return 0
+
+ except FileNotFoundError as e:
+ print(f"[ERROR] {e}")
+ return 1
+ except Exception as e:
+ print(f"[ERROR] Analysis failed: {e}")
+ import traceback
+
+ traceback.print_exc()
+ return 1
+
+
+if __name__ == "__main__":
+ import sys
+
+ sys.exit(main())
diff --git a/scripts/analysis/analyze_missing_scores.py b/scripts/analysis/analyze_missing_scores.py
new file mode 100644
index 000000000..0d2c857eb
--- /dev/null
+++ b/scripts/analysis/analyze_missing_scores.py
@@ -0,0 +1,121 @@
+"""Analyze missing scores to understand gaps in data collection.
+
+Usage:
+ uv run python scripts/analyze_missing_scores.py
+"""
+
+from __future__ import annotations
+
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import write_csv
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def analyze_missing_scores(db_path: Path) -> None:
+ """Analyze which games have odds but no scores.
+
+ Args:
+ db_path: Path to Odds API database
+ """
+ db = OddsAPIDatabase(str(db_path))
+
+ # Get events with odds but no scores, that are in the past
+ query = """
+ SELECT
+ e.event_id,
+ e.home_team,
+ e.away_team,
+ e.commence_time,
+ DATE(e.commence_time) as game_date,
+ COUNT(DISTINCT o.book_key) as num_bookmakers,
+ COUNT(DISTINCT o.market_key) as num_markets
+ FROM events e
+ INNER JOIN observations o ON e.event_id = o.event_id
+ LEFT JOIN scores s ON e.event_id = s.event_id
+ WHERE s.event_id IS NULL
+ AND DATE(e.commence_time) < DATE('now')
+ GROUP BY e.event_id, e.home_team, e.away_team, e.commence_time
+ ORDER BY e.commence_time DESC
+ """
+
+ missing_df = pd.read_sql_query(query, db.conn)
+
+ logger.info(f"Total games with odds but no scores: {len(missing_df)}")
+
+ if len(missing_df) == 0:
+ logger.info("[OK] No missing scores!")
+ return
+
+ # Analyze by date
+ logger.info("\n" + "=" * 80)
+ logger.info("MISSING SCORES BY DATE")
+ logger.info("=" * 80)
+
+ by_date = missing_df.groupby("game_date").size().sort_index(ascending=False)
+ logger.info(f"\n{by_date.head(20).to_string()}")
+
+ # Date range
+ logger.info("\n" + "=" * 80)
+ logger.info("DATE RANGE")
+ logger.info("=" * 80)
+ logger.info(f"Earliest missing: {missing_df['game_date'].min()}")
+ logger.info(f"Latest missing: {missing_df['game_date'].max()}")
+ logger.info(f"Unique dates: {missing_df['game_date'].nunique()}")
+
+ # Sample of recent missing games
+ logger.info("\n" + "=" * 80)
+ logger.info("RECENT MISSING GAMES (last 10)")
+ logger.info("=" * 80)
+
+ recent = missing_df.head(10)
+ for _, row in recent.iterrows():
+ logger.info(
+ f"{row['game_date']}: {row['away_team']:30s} @ {row['home_team']:30s} "
+ f"({row['num_bookmakers']} books, {row['num_markets']} markets)"
+ )
+
+ # Export detailed list
+ output_path = Path("data/missing_scores_detail.csv")
+ write_csv(missing_df, output_path, index=False)
+ logger.info(f"\nDetailed list exported to: {output_path}")
+
+ # Check how many scores we DO have
+ scores_query = """
+ SELECT
+ DATE(e.commence_time) as game_date,
+ COUNT(*) as games_with_scores
+ FROM scores s
+ INNER JOIN events e ON s.event_id = e.event_id
+ WHERE s.completed = 1
+ GROUP BY DATE(e.commence_time)
+ ORDER BY game_date DESC
+ """
+
+ scores_df = pd.read_sql_query(scores_query, db.conn)
+
+ logger.info("\n" + "=" * 80)
+ logger.info("SCORES WE DO HAVE (by date)")
+ logger.info("=" * 80)
+ logger.info(f"\n{scores_df.head(20).to_string()}")
+
+
+def main() -> None:
+ """Analyze missing scores."""
+ db_path = Path("data/odds_api/odds_api.sqlite3")
+
+ if not db_path.exists():
+ logger.error(f"Database not found: {db_path}")
+ return
+
+ analyze_missing_scores(db_path)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/analyze_overtime_performance.py b/scripts/analysis/analyze_overtime_performance.py
new file mode 100644
index 000000000..feeb1cace
--- /dev/null
+++ b/scripts/analysis/analyze_overtime_performance.py
@@ -0,0 +1,237 @@
+"""Analyze overtime.ag betting performance.
+
+This script loads saved parquet data and generates performance reports.
+
+Usage:
+ uv run python scripts/analyze_overtime_performance.py
+"""
+
+import sys
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+
+# Add src to path
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+
+pd.set_option("display.max_columns", None)
+pd.set_option("display.width", None)
+pd.set_option("display.max_colwidth", 50)
+
+
+def load_latest_data(data_dir: Path) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+ """Load the most recent data files.
+
+ Returns:
+ Tuple of (account_df, bets_df, figures_df)
+ """
+ account_files = sorted((data_dir / "account_balance").glob("*.parquet"))
+ bets_files = sorted((data_dir / "open_bets").glob("*.parquet"))
+ figures_files = sorted((data_dir / "daily_figures").glob("*.parquet"))
+
+ account_df = read_parquet_df(str(account_files[-1])) if account_files else pd.DataFrame()
+ bets_df = read_parquet_df(str(bets_files[-1])) if bets_files else pd.DataFrame()
+ figures_df = read_parquet_df(str(figures_files[-1])) if figures_files else pd.DataFrame()
+
+ return account_df, bets_df, figures_df
+
+
+def print_section(title: str):
+ """Print a section header."""
+ print("\n" + "=" * 80)
+ print(title.center(80))
+ print("=" * 80 + "\n")
+
+
+def analyze_account(account_df: pd.DataFrame):
+ """Analyze account balance."""
+ if account_df.empty:
+ print("[WARNING] No account data found")
+ return
+
+ latest = account_df.iloc[-1]
+
+ print(f"Current Balance: ${latest['balance']:,.2f}")
+ print(f"Credit Limit: ${latest['credit_limit']:,.2f}")
+ print(f"Pending Bets: ${latest['pending']:,.2f}")
+ print(f"Available Balance: ${latest['available_balance']:,.2f}")
+ print(f"Casino Balance: ${latest['casino_balance']:,.2f}")
+
+ # Calculate utilization
+ utilization = (latest["pending"] / latest["available_balance"]) * 100
+ print(f"\nBankroll Utilization: {utilization:.1f}%")
+
+ if utilization > 20:
+ print("[WARNING] High bankroll utilization - consider reducing exposure")
+
+
+def analyze_open_bets(bets_df: pd.DataFrame):
+ """Analyze open bets."""
+ if bets_df.empty:
+ print("[INFO] No open bets")
+ return
+
+ print(f"Total Open Bets: {len(bets_df)}")
+ print(f"Total at Risk: ${bets_df['risk_amount'].sum():,.2f}")
+ print(f"Potential Win: ${bets_df['to_win_amount'].sum():,.2f}")
+
+ # Breakdown by bet type
+ print("\nBet Type Breakdown:")
+ type_summary = bets_df.groupby("bet_type").agg(
+ {
+ "risk_amount": ["count", "sum"],
+ "to_win_amount": "sum",
+ }
+ )
+ print(type_summary)
+
+ # Show individual bets
+ print("\nOpen Bets Details:")
+ display_cols = ["team", "line", "odds", "risk_amount", "to_win_amount"]
+ available_cols = [col for col in display_cols if col in bets_df.columns]
+
+ if available_cols:
+ print(bets_df[available_cols].to_string(index=False))
+ else:
+ print(
+ bets_df[["bet_type", "details", "risk_amount", "to_win_amount"]].to_string(index=False)
+ )
+
+
+def analyze_performance(figures_df: pd.DataFrame):
+ """Analyze betting performance."""
+ if figures_df.empty:
+ print("[WARNING] No performance data found")
+ return
+
+ # Sort by date descending
+ figures_df["date"] = pd.to_datetime(figures_df["date"])
+ figures_df = figures_df.sort_values("date", ascending=False)
+
+ # Current week analysis
+ current_week = figures_df[figures_df["period"] == "current_week"].iloc[0]
+ print("Current Week:")
+ print(f" Starting Balance: ${current_week['starting_balance']:,.2f}")
+ print(f" Week Total P&L: ${current_week['week_total']:,.2f}")
+ print(f" Ending Balance: ${current_week['ending_balance']:,.2f}")
+
+ if current_week["starting_balance"] != 0:
+ roi = (current_week["week_total"] / abs(current_week["starting_balance"])) * 100
+ print(f" ROI: {roi:+.2f}%")
+
+ # Last week analysis
+ last_week_data = figures_df[figures_df["period"] == "last_week"]
+ if not last_week_data.empty:
+ last_week = last_week_data.iloc[0]
+ print("\nLast Week:")
+ print(f" Week Total P&L: ${last_week['week_total']:,.2f}")
+ print(f" Ending Balance: ${last_week['ending_balance']:,.2f}")
+
+ if last_week["starting_balance"] != 0:
+ roi = (last_week["week_total"] / abs(last_week["starting_balance"])) * 100
+ print(f" ROI: {roi:+.2f}%")
+
+ # Overall performance
+ past_weeks = figures_df[figures_df["period"] == "past_week"]
+ if not past_weeks.empty:
+ print(f"\nHistorical Performance ({len(past_weeks)} weeks):")
+ total_pnl = past_weeks["week_total"].sum()
+ avg_weekly = past_weeks["week_total"].mean()
+ win_rate = (past_weeks["week_total"] > 0).sum() / len(past_weeks) * 100
+
+ print(f" Total P&L: ${total_pnl:,.2f}")
+ print(f" Avg Weekly P&L: ${avg_weekly:,.2f}")
+ print(f" Win Rate: {win_rate:.1f}%")
+
+ # Best and worst weeks
+ best_week = past_weeks.loc[past_weeks["week_total"].idxmax()]
+ worst_week = past_weeks.loc[past_weeks["week_total"].idxmin()]
+
+ print(f"\n Best Week: ${best_week['week_total']:,.2f} ({best_week['date'].date()})")
+ print(f" Worst Week: ${worst_week['week_total']:,.2f} ({worst_week['date'].date()})")
+
+
+def generate_recommendations(
+ account_df: pd.DataFrame, bets_df: pd.DataFrame, figures_df: pd.DataFrame
+):
+ """Generate betting recommendations based on performance."""
+ print_section("RECOMMENDATIONS")
+
+ if account_df.empty or figures_df.empty:
+ print("[INFO] Insufficient data for recommendations")
+ return
+
+ latest_account = account_df.iloc[-1]
+ current_week = figures_df[figures_df["period"] == "current_week"].iloc[0]
+
+ # Check current week performance
+ if current_week["week_total"] < 0:
+ loss_pct = (current_week["week_total"] / abs(current_week["starting_balance"])) * 100
+ if loss_pct < -10:
+ print("[ALERT] Current week down >10% - consider reducing unit size")
+
+ # Check pending exposure
+ if not bets_df.empty:
+ pending_ratio = latest_account["pending"] / latest_account["available_balance"]
+ if pending_ratio > 0.25:
+ print("[ALERT] High pending exposure (>25% of bankroll) - avoid adding more bets")
+
+ # Check recent trend
+ past_weeks = figures_df[figures_df["period"] == "past_week"]
+ if len(past_weeks) >= 3:
+ recent_three = past_weeks.head(3)["week_total"]
+ if (recent_three < 0).all():
+ print("[ALERT] Three consecutive losing weeks - review betting strategy")
+
+ # Positive indicators
+ last_week = figures_df[figures_df["period"] == "last_week"]
+ if not last_week.empty and last_week.iloc[0]["week_total"] > 0:
+ print("[OK] Positive momentum from last week")
+
+ print("\n[TIP] Focus on Closing Line Value (CLV) rather than win rate")
+ print("[TIP] Track line movements to validate model predictions")
+
+
+def main():
+ """Main execution function."""
+ project_root = Path(__file__).parent.parent
+ data_dir = project_root / "data" / "overtime"
+
+ if not data_dir.exists():
+ print(f"[ERROR] Data directory not found: {data_dir}")
+ print("Run the snapshot script first to collect data")
+ return
+
+ print_section("OVERTIME.AG PERFORMANCE ANALYSIS")
+ print(f"Data Directory: {data_dir}")
+ print(f"Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+
+ # Load data
+ account_df, bets_df, figures_df = load_latest_data(data_dir)
+
+ # Account analysis
+ print_section("ACCOUNT STATUS")
+ analyze_account(account_df)
+
+ # Open bets analysis
+ print_section("OPEN BETS")
+ analyze_open_bets(bets_df)
+
+ # Performance analysis
+ print_section("PERFORMANCE ANALYSIS")
+ analyze_performance(figures_df)
+
+ # Recommendations
+ generate_recommendations(account_df, bets_df, figures_df)
+
+ print("\n" + "=" * 80)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/analyze_shap_legacy.py b/scripts/analysis/analyze_shap_legacy.py
new file mode 100644
index 000000000..c31a7e637
--- /dev/null
+++ b/scripts/analysis/analyze_shap_legacy.py
@@ -0,0 +1,154 @@
+"""SHAP feature importance analysis for legacy dataset models.
+
+Analyzes which features are most important for spreads and totals prediction
+to identify candidates for feature pruning.
+
+Usage:
+ python scripts/analysis/analyze_shap_legacy.py
+"""
+
+from __future__ import annotations
+
+import logging
+from pathlib import Path
+
+import pandas as pd
+from sklearn.model_selection import train_test_split
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+def analyze_model(model_type: str) -> None:
+ """Analyze SHAP feature importance for a model type.
+
+ Args:
+ model_type: "spreads" or "totals"
+ """
+ logger.info(f"\n[OK] === Analyzing {model_type.upper()} Model ===\n")
+
+ # Load legacy complete dataset
+ df = read_parquet_df("data/staging/complete_dataset.parquet")
+
+ # Define features based on model type
+ if model_type == "spreads":
+ feature_cols = [
+ "consensus_opening_spread_magnitude",
+ "consensus_closing_spread_magnitude",
+ "opening_spread_range",
+ "closing_spread_range",
+ "num_books_spread",
+ "spread_magnitude_movement",
+ "opening_home_implied_prob",
+ "closing_home_implied_prob",
+ "home_is_favorite",
+ ]
+ label_col = "home_covered_spread"
+ else: # totals
+ feature_cols = [
+ "consensus_opening_total",
+ "consensus_closing_total",
+ "opening_total_range",
+ "closing_total_range",
+ "num_books_total",
+ "total_movement",
+ ]
+ label_col = "went_over"
+
+ # Filter to games with required features
+ required_cols = feature_cols + [label_col]
+ df_clean = df.dropna(subset=required_cols)
+
+ X = df_clean[feature_cols]
+ y = df_clean[label_col]
+
+ # Load best model (seed 1024)
+
+ model_path = Path(f"data/models/{model_type}_2026_seed_ensemble_legacy_metadata.json")
+ if not model_path.exists():
+ logger.error(f"Model metadata not found: {model_path}")
+ return
+
+ # Train a single model with best seed for SHAP analysis
+ import xgboost as xgb
+
+ X_train, X_val, y_train, y_val = train_test_split(
+ X, y, test_size=0.2, random_state=42, stratify=y
+ )
+
+ # Use simple parameters
+ params = {
+ "n_estimators": 300,
+ "max_depth": 6,
+ "learning_rate": 0.1,
+ "min_child_weight": 5,
+ "gamma": 1.0,
+ "reg_alpha": 1.0,
+ "reg_lambda": 1.0,
+ "subsample": 0.8,
+ "colsample_bytree": 0.8,
+ "objective": "binary:logistic",
+ "eval_metric": "logloss",
+ "early_stopping_rounds": 20,
+ "random_state": 1024, # Best seed
+ }
+
+ model = xgb.XGBClassifier(**params)
+ model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
+
+ logger.info(f"Trained model on {len(X_train)} samples")
+ logger.info(f"Validation set: {len(X_val)} samples\n")
+
+ # Use XGBoost built-in feature importance (gain-based)
+ logger.info("Calculating feature importance (using XGBoost gain)...")
+ importance_dict = model.get_booster().get_score(importance_type="gain")
+
+ # Convert to pandas Series with feature names
+ importance_values = [importance_dict.get(f"f{i}", 0.0) for i in range(len(feature_cols))]
+ importance = pd.Series(importance_values, index=feature_cols).sort_values(ascending=False)
+
+ # Calculate percentage
+ total_importance = importance.sum()
+ importance_pct = (importance / total_importance * 100).round(2)
+
+ # Create summary dataframe
+ importance_df = pd.DataFrame(
+ {
+ "Feature": importance.index,
+ "Importance": importance.values.round(4),
+ "Percentage": importance_pct.values,
+ }
+ ).reset_index(drop=True)
+
+ logger.info(f"\n[OK] === {model_type.upper()} Feature Importance ===\n")
+ logger.info(importance_df.to_string(index=False))
+
+ # Identify weak features (<1% importance)
+ weak_features = importance_df[importance_df["Percentage"] < 1.0]
+ if len(weak_features) > 0:
+ logger.info("\n[WARNING] Weak features (<1% importance):")
+ for _, row in weak_features.iterrows():
+ logger.info(f" - {row['Feature']}: {row['Percentage']:.2f}%")
+ logger.info(f"\nConsider removing {len(weak_features)} weak features")
+ else:
+ logger.info("\n[OK] All features have >1% importance")
+
+ # Save results
+ output_dir = Path("data/analysis")
+ output_dir.mkdir(parents=True, exist_ok=True)
+ output_path = output_dir / f"shap_importance_{model_type}_legacy.csv"
+ importance_df.to_csv(output_path, index=False)
+ logger.info(f"\n[OK] Saved importance to {output_path}\n")
+
+
+def main() -> None:
+ """Run SHAP analysis for both model types."""
+ analyze_model("spreads")
+ analyze_model("totals")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/backtest_2026_02_05.py b/scripts/analysis/backtest_2026_02_05.py
new file mode 100644
index 000000000..31c6c4e25
--- /dev/null
+++ b/scripts/analysis/backtest_2026_02_05.py
@@ -0,0 +1,273 @@
+"""Backtest 2026-02-05 games against trained models."""
+
+from __future__ import annotations
+
+import logging
+from pathlib import Path
+
+import joblib
+import pandas as pd
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+logging.basicConfig(
+ level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+
+
+def main() -> None:
+ """Backtest 2026-02-05 games."""
+ # Note: Most games on 2026-02-05 UTC are actually 2026-02-04 Pacific time
+ target_date_utc = "2026-02-05"
+ target_date_pacific = "2026-02-04" # Most games
+ base_path = Path(__file__).parent.parent.parent
+
+ # Load models
+ logger.info("Loading trained models...")
+ spreads_model = joblib.load(base_path / "models" / "spreads_2026_optimized_v2.pkl")
+ totals_model = joblib.load(base_path / "models" / "totals_2026_optimized_v2.pkl")
+ home_score_model = joblib.load(base_path / "models" / "home_score_2026.pkl")
+ away_score_model = joblib.load(base_path / "models" / "away_score_2026.pkl")
+
+ # Get games with scores from database
+ logger.info(f"Fetching games for {target_date_utc} (UTC)...")
+ db = OddsAPIDatabase(str(base_path / "data" / "odds_api" / "odds_api.sqlite3"))
+
+ query = """
+ SELECT
+ e.event_id,
+ e.home_team,
+ e.away_team,
+ e.commence_time,
+ s.home_score,
+ s.away_score
+ FROM events e
+ JOIN scores s ON e.event_id = s.event_id
+ WHERE DATE(e.commence_time) = ?
+ ORDER BY e.commence_time
+ """
+
+ games_df = pd.read_sql_query(query, db.conn, params=(target_date_utc,))
+ logger.info(f"Found {len(games_df)} games with scores on {target_date_utc} UTC\n")
+
+ if len(games_df) == 0:
+ logger.warning(f"No games with scores found for {target_date_utc}")
+ return
+
+ # Print games
+ for _, game in games_df.iterrows():
+ logger.info(
+ f" {game['away_team']} @ {game['home_team']}: "
+ f"{game['away_score']}-{game['home_score']}"
+ )
+
+ # Build features using FeatureEngineer
+ # Note: Use Pacific timezone dates since staging uses Pacific time
+ logger.info(
+ f"\nBuilding features using staging layer "
+ f"(Pacific timezone: {target_date_pacific} to 2026-02-05)..."
+ )
+ engineer = FeatureEngineer(staging_path=str(base_path / "data" / "staging"))
+
+ # Build spreads features and targets (Pacific timezone date range)
+ X_spreads, y_spreads = engineer.build_spreads_dataset(
+ start_date=target_date_pacific, end_date="2026-02-05"
+ )
+
+ # Build totals features and targets (Pacific timezone date range)
+ X_totals, y_totals = engineer.build_totals_dataset(
+ start_date=target_date_pacific, end_date="2026-02-05"
+ )
+
+ logger.info(f"Spreads features: {len(X_spreads)} games, {len(X_spreads.columns)} features")
+ logger.info(f"Totals features: {len(X_totals)} games, {len(X_totals.columns)} features")
+
+ if len(X_spreads) == 0 and len(X_totals) == 0:
+ logger.warning(
+ "No features could be built. Check that staging files contain data for this date."
+ )
+ return
+
+ # Load event metadata from staging (once for both spreads and totals)
+ staging_events = engineer.load_staging_data(
+ start_date=target_date_pacific,
+ end_date="2026-02-05",
+ season=2026,
+ require_line_features=False,
+ use_home_away=True,
+ )
+
+ logger.info(f"Loaded {len(staging_events)} staging events with metadata\n")
+
+ # Generate predictions
+ results = []
+
+ # Spreads predictions
+ if len(X_spreads) > 0:
+ spreads_pred = spreads_model.predict_proba(X_spreads)[:, 1]
+
+ for idx in range(len(X_spreads)):
+ event_row = staging_events.iloc[idx]
+ actual_covered = y_spreads.iloc[idx]
+ pred_prob = spreads_pred[idx]
+ pred_outcome = pred_prob > 0.5
+
+ results.append(
+ {
+ "event_id": event_row["event_id"],
+ "home_team": event_row["home_team"],
+ "away_team": event_row["away_team"],
+ "favorite_team": event_row.get("favorite_team", "Unknown"),
+ "underdog_team": event_row.get("underdog_team", "Unknown"),
+ "market": "spread",
+ "prediction": "favorite" if pred_outcome else "underdog",
+ "prob_favorite": pred_prob,
+ "spread_points": event_row.get("closing_spread", 0.0),
+ "home_score": event_row["home_score"],
+ "away_score": event_row["away_score"],
+ "actual_covered_favorite": actual_covered,
+ "correct": pred_outcome == actual_covered,
+ }
+ )
+
+ # Totals predictions
+ if len(X_totals) > 0:
+ # Check if we need to add missing line features
+ # Models were trained with 31 features but we only have 28 without line features
+ if len(X_totals.columns) == 28:
+ logger.info("Adding missing line features (opening/closing totals) with defaults...")
+ X_totals["opening_total"] = X_totals["expected_total"]
+ X_totals["closing_total"] = X_totals["expected_total"]
+ X_totals["total_movement"] = 0.0
+
+ totals_pred = totals_model.predict_proba(X_totals)[:, 1]
+
+ # Score predictions
+ home_pred = home_score_model.predict(X_totals)
+ away_pred = away_score_model.predict(X_totals)
+
+ for idx in range(len(X_totals)):
+ event_row = staging_events.iloc[idx]
+ actual_went_over = y_totals.iloc[idx]
+ pred_prob = totals_pred[idx]
+ pred_outcome = pred_prob > 0.5
+
+ total_score = event_row["home_score"] + event_row["away_score"]
+
+ results.append(
+ {
+ "event_id": event_row["event_id"],
+ "home_team": event_row["home_team"],
+ "away_team": event_row["away_team"],
+ "market": "total",
+ "prediction": "over" if pred_outcome else "under",
+ "prob_over": pred_prob,
+ "total_line": event_row.get(
+ "closing_total", X_totals["expected_total"].iloc[idx]
+ ),
+ "home_score": event_row["home_score"],
+ "away_score": event_row["away_score"],
+ "actual_total": total_score,
+ "actual_went_over": actual_went_over,
+ "correct": pred_outcome == actual_went_over,
+ "pred_home_score": home_pred[idx],
+ "pred_away_score": away_pred[idx],
+ "pred_total": home_pred[idx] + away_pred[idx],
+ }
+ )
+
+ results_df = pd.DataFrame(results)
+
+ # Calculate metrics
+ logger.info("\n" + "=" * 80)
+ logger.info(f"BACKTEST RESULTS - {target_date_utc} (UTC) = {target_date_pacific} Pacific")
+ logger.info("=" * 80)
+
+ # Spreads performance
+ spreads_results = results_df[results_df["market"] == "spread"]
+ if len(spreads_results) > 0:
+ spreads_accuracy = spreads_results["correct"].mean()
+ logger.info("\nSPREADS MODEL:")
+ logger.info(f" Games: {len(spreads_results)}")
+ logger.info(f" Accuracy: {spreads_accuracy:.1%}")
+ logger.info(f" Correct: {spreads_results['correct'].sum()}")
+ logger.info(f" Incorrect: {(~spreads_results['correct']).sum()}")
+
+ # Show confidence breakdown
+ high_conf = spreads_results[
+ (spreads_results["prob_favorite"] > 0.6) | (spreads_results["prob_favorite"] < 0.4)
+ ]
+ if len(high_conf) > 0:
+ logger.info(
+ f" High confidence (>60% or <40%): {high_conf['correct'].mean():.1%} "
+ f"({len(high_conf)} games)"
+ )
+
+ # Totals performance
+ totals_results = results_df[results_df["market"] == "total"]
+ if len(totals_results) > 0:
+ totals_accuracy = totals_results["correct"].mean()
+ mae = (totals_results["actual_total"] - totals_results["pred_total"]).abs().mean()
+ rmse = ((totals_results["actual_total"] - totals_results["pred_total"]) ** 2).mean() ** 0.5
+
+ logger.info("\nTOTALS MODEL:")
+ logger.info(f" Games: {len(totals_results)}")
+ logger.info(f" Accuracy: {totals_accuracy:.1%}")
+ logger.info(f" Correct: {totals_results['correct'].sum()}")
+ logger.info(f" Incorrect: {(~totals_results['correct']).sum()}")
+ logger.info("\nSCORE PREDICTION:")
+ logger.info(f" MAE: {mae:.2f} points")
+ logger.info(f" RMSE: {rmse:.2f} points")
+
+ # Show confidence breakdown
+ high_conf = totals_results[
+ (totals_results["prob_over"] > 0.6) | (totals_results["prob_over"] < 0.4)
+ ]
+ if len(high_conf) > 0:
+ logger.info(
+ f" High confidence (>60% or <40%): {high_conf['correct'].mean():.1%} "
+ f"({len(high_conf)} games)"
+ )
+
+ # Save detailed results
+ output_path = base_path / "predictions" / f"{target_date_utc}_backtest.csv"
+ results_df.to_csv(output_path, index=False)
+ logger.info(f"\nDetailed results saved to: {output_path}")
+
+ # Show all predictions
+ logger.info("\n" + "=" * 80)
+ logger.info("DETAILED PREDICTIONS:")
+ logger.info("=" * 80)
+
+ # Spreads
+ spreads_results = results_df[results_df["market"] == "spread"]
+ if len(spreads_results) > 0:
+ logger.info("\nSPREADS:")
+ for _, row in spreads_results.iterrows():
+ status = "[OK]" if row["correct"] else "[WRONG]"
+ logger.info(
+ f" {status} {row['away_team']} @ {row['home_team']}: "
+ f"Pred={row['prediction']} ({row['prob_favorite']:.1%}), "
+ f"Spread={row['spread_points']:.1f}, "
+ f"Score={int(row['away_score'])}-{int(row['home_score'])}"
+ )
+
+ # Totals
+ totals_results = results_df[results_df["market"] == "total"]
+ if len(totals_results) > 0:
+ logger.info("\nTOTALS:")
+ for _, row in totals_results.iterrows():
+ status = "[OK]" if row["correct"] else "[WRONG]"
+ logger.info(
+ f" {status} {row['away_team']} @ {row['home_team']}: "
+ f"Pred={row['prediction']} ({row['prob_over']:.1%}), "
+ f"Line={row['total_line']:.1f}, "
+ f"Actual={int(row['actual_total'])} "
+ f"(pred={row['pred_total']:.1f})"
+ )
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/backtest_kenpom_fanmatch.py b/scripts/analysis/backtest_kenpom_fanmatch.py
new file mode 100644
index 000000000..c4910ec01
--- /dev/null
+++ b/scripts/analysis/backtest_kenpom_fanmatch.py
@@ -0,0 +1,354 @@
+"""Backtest KenPom FanMatch predictions against actual results.
+
+Evaluates KenPom FanMatch prediction accuracy for:
+- Predicted scores (MAE, RMSE)
+- Spread predictions (cover rate, ATS accuracy)
+- Total predictions (over/under accuracy)
+- Win probability calibration
+
+Usage:
+ python scripts/analysis/backtest_kenpom_fanmatch.py --start 2025-11-01 --end 2026-02-06
+ python scripts/analysis/backtest_kenpom_fanmatch.py --season 2026
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+from datetime import date, datetime, timedelta
+from pathlib import Path
+from zoneinfo import ZoneInfo
+
+import pandas as pd
+
+from sports_betting_edge.adapters.kenpom import KenPomAdapter
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+PST = ZoneInfo("America/Los_Angeles")
+
+
+async def fetch_fanmatch_for_date_range(start_date: date, end_date: date) -> pd.DataFrame:
+ """Fetch KenPom FanMatch predictions for a date range.
+
+ Args:
+ start_date: Start date
+ end_date: End date (inclusive)
+
+ Returns:
+ DataFrame with FanMatch predictions
+ """
+ logger.info(f"Fetching FanMatch predictions from {start_date} to {end_date}...")
+
+ kenpom = KenPomAdapter()
+ all_predictions = []
+
+ try:
+ # Fetch predictions for each date
+ current_date = start_date
+ while current_date <= end_date:
+ date_str = current_date.isoformat()
+ try:
+ games = await kenpom.get_fanmatch(date_str)
+ logger.info(f" {date_str}: {len(games)} games")
+
+ for game in games:
+ home_pred = game.get("HomePred")
+ visitor_pred = game.get("VisitorPred")
+
+ all_predictions.append(
+ {
+ "game_date": date_str,
+ "kenpom_game_id": game.get("GameID"),
+ "kenpom_home": game.get("Home"),
+ "kenpom_visitor": game.get("Visitor"),
+ "kenpom_home_rank": game.get("HomeRank"),
+ "kenpom_visitor_rank": game.get("VisitorRank"),
+ "kenpom_home_pred": home_pred,
+ "kenpom_visitor_pred": visitor_pred,
+ "kenpom_predicted_spread": (
+ home_pred - visitor_pred if home_pred and visitor_pred else None
+ ),
+ "kenpom_predicted_total": (
+ home_pred + visitor_pred if home_pred and visitor_pred else None
+ ),
+ "kenpom_home_wp": game.get("HomeWP"),
+ "kenpom_pred_tempo": game.get("PredTempo"),
+ "kenpom_thrill_score": game.get("ThrillScore"),
+ }
+ )
+ except Exception as e:
+ logger.warning(f" Failed to fetch {date_str}: {e}")
+
+ current_date += timedelta(days=1)
+
+ finally:
+ await kenpom.close()
+
+ predictions_df = pd.DataFrame(all_predictions)
+ logger.info(f" Total predictions: {len(predictions_df)}")
+
+ return predictions_df
+
+
+def match_with_actual_results(fanmatch_df: pd.DataFrame, db: OddsAPIDatabase) -> pd.DataFrame:
+ """Match FanMatch predictions with actual game results.
+
+ Args:
+ fanmatch_df: DataFrame with FanMatch predictions
+ db: Database with actual results
+
+ Returns:
+ DataFrame with predictions and actual results merged
+ """
+ logger.info("Matching FanMatch predictions with actual results...")
+
+ # Get completed games with scores
+ query = """
+ SELECT
+ e.event_id,
+ e.home_team,
+ e.away_team,
+ e.commence_time,
+ s.home_score,
+ s.away_score,
+ DATE(e.commence_time) as game_date
+ FROM events e
+ INNER JOIN scores s ON e.event_id = s.event_id
+ WHERE s.home_score IS NOT NULL
+ AND s.away_score IS NOT NULL
+ AND s.completed = 1
+ ORDER BY e.commence_time
+ """
+
+ actual_df = pd.read_sql_query(query, db.conn)
+ logger.info(f" Found {len(actual_df)} completed games in database")
+
+ # Calculate actual spread and total
+ actual_df["actual_home_score"] = actual_df["home_score"]
+ actual_df["actual_away_score"] = actual_df["away_score"]
+ actual_df["actual_spread"] = actual_df["home_score"] - actual_df["away_score"]
+ actual_df["actual_total"] = actual_df["home_score"] + actual_df["away_score"]
+ actual_df["actual_home_won"] = (actual_df["actual_spread"] > 0).astype(int)
+
+ # Match by date and team names
+ matched = []
+ for _, fm in fanmatch_df.iterrows():
+ kp_home = fm["kenpom_home"]
+ kp_visitor = fm["kenpom_visitor"]
+ game_date = fm["game_date"]
+
+ if not kp_home or not kp_visitor:
+ continue
+
+ # Find matching game in actual results
+ date_games = actual_df[actual_df["game_date"] == game_date]
+
+ for _, actual in date_games.iterrows():
+ our_home = actual["home_team"]
+ our_away = actual["away_team"]
+
+ # Match if KenPom name is contained in our full name
+ home_match = kp_home in our_home or our_home in kp_home
+ away_match = kp_visitor in our_away or our_away in kp_visitor
+
+ if home_match and away_match:
+ matched.append(
+ {
+ **fm.to_dict(),
+ "event_id": actual["event_id"],
+ "our_home_team": our_home,
+ "our_away_team": our_away,
+ "actual_home_score": actual["actual_home_score"],
+ "actual_away_score": actual["actual_away_score"],
+ "actual_spread": actual["actual_spread"],
+ "actual_total": actual["actual_total"],
+ "actual_home_won": actual["actual_home_won"],
+ }
+ )
+ break
+
+ matched_df = pd.DataFrame(matched)
+ logger.info(f" Matched {len(matched_df)} games with actual results")
+
+ return matched_df
+
+
+def calculate_metrics(results_df: pd.DataFrame) -> dict:
+ """Calculate prediction accuracy metrics.
+
+ Args:
+ results_df: DataFrame with predictions and actuals
+
+ Returns:
+ Dictionary of metrics
+ """
+ logger.info("Calculating prediction accuracy metrics...")
+
+ # Score prediction accuracy
+ home_mae = abs(results_df["kenpom_home_pred"] - results_df["actual_home_score"]).mean()
+ away_mae = abs(results_df["kenpom_visitor_pred"] - results_df["actual_away_score"]).mean()
+
+ home_rmse = (
+ (results_df["kenpom_home_pred"] - results_df["actual_home_score"]) ** 2
+ ).mean() ** 0.5
+ away_rmse = (
+ (results_df["kenpom_visitor_pred"] - results_df["actual_away_score"]) ** 2
+ ).mean() ** 0.5
+
+ # Spread prediction accuracy
+ spread_mae = abs(results_df["kenpom_predicted_spread"] - results_df["actual_spread"]).mean()
+ spread_rmse = (
+ (results_df["kenpom_predicted_spread"] - results_df["actual_spread"]) ** 2
+ ).mean() ** 0.5
+
+ # Total prediction accuracy
+ total_mae = abs(results_df["kenpom_predicted_total"] - results_df["actual_total"]).mean()
+ total_rmse = (
+ (results_df["kenpom_predicted_total"] - results_df["actual_total"]) ** 2
+ ).mean() ** 0.5
+
+ # Win prediction accuracy
+ results_df["predicted_home_won"] = (results_df["kenpom_predicted_spread"] > 0).astype(int)
+ win_accuracy = (results_df["predicted_home_won"] == results_df["actual_home_won"]).mean()
+
+ # Win probability calibration (binned)
+ wp_bins = [0, 20, 40, 60, 80, 100]
+ results_df["wp_bin"] = pd.cut(results_df["kenpom_home_wp"], bins=wp_bins, labels=wp_bins[1:])
+ wp_calibration = results_df.groupby("wp_bin")["actual_home_won"].mean() * 100
+
+ metrics = {
+ "n_games": len(results_df),
+ "score_prediction": {
+ "home_mae": round(home_mae, 2),
+ "away_mae": round(away_mae, 2),
+ "home_rmse": round(home_rmse, 2),
+ "away_rmse": round(away_rmse, 2),
+ },
+ "spread_prediction": {
+ "mae": round(spread_mae, 2),
+ "rmse": round(spread_rmse, 2),
+ },
+ "total_prediction": {
+ "mae": round(total_mae, 2),
+ "rmse": round(total_rmse, 2),
+ },
+ "win_prediction": {"accuracy": round(win_accuracy, 4)},
+ "win_probability_calibration": wp_calibration.to_dict(),
+ }
+
+ return metrics
+
+
+async def main_async(args: argparse.Namespace) -> None:
+ """Main async function."""
+ logger.info("[OK] === KenPom FanMatch Backtesting ===\n")
+
+ # Determine date range
+ if args.season:
+ # Season runs roughly Nov - Apr
+ start_date = date(args.season - 1, 11, 1)
+ end_date = date.today()
+ logger.info(f"Using season {args.season}: {start_date} to {end_date}")
+ else:
+ start_date = datetime.strptime(args.start, "%Y-%m-%d").date()
+ end_date = datetime.strptime(args.end, "%Y-%m-%d").date()
+ logger.info(f"Using date range: {start_date} to {end_date}")
+
+ # Fetch FanMatch predictions
+ fanmatch_df = await fetch_fanmatch_for_date_range(start_date, end_date)
+
+ if len(fanmatch_df) == 0:
+ logger.warning("No FanMatch predictions found for date range")
+ return
+
+ # Match with actual results
+ db = OddsAPIDatabase(args.db_path)
+ results_df = match_with_actual_results(fanmatch_df, db)
+
+ if len(results_df) == 0:
+ logger.warning("No matches found between predictions and actual results")
+ return
+
+ # Calculate metrics
+ metrics = calculate_metrics(results_df)
+
+ # Display results
+ logger.info("\n=== Backtest Results ===")
+ logger.info(f"Games analyzed: {metrics['n_games']}")
+ logger.info("\nScore Prediction Accuracy:")
+ logger.info(
+ f" Home: MAE={metrics['score_prediction']['home_mae']}, "
+ f"RMSE={metrics['score_prediction']['home_rmse']}"
+ )
+ logger.info(
+ f" Away: MAE={metrics['score_prediction']['away_mae']}, "
+ f"RMSE={metrics['score_prediction']['away_rmse']}"
+ )
+ logger.info("\nSpread Prediction Accuracy:")
+ logger.info(
+ f" MAE={metrics['spread_prediction']['mae']}, RMSE={metrics['spread_prediction']['rmse']}"
+ )
+ logger.info("\nTotal Prediction Accuracy:")
+ logger.info(
+ f" MAE={metrics['total_prediction']['mae']}, RMSE={metrics['total_prediction']['rmse']}"
+ )
+ logger.info(f"\nWin Prediction Accuracy: {metrics['win_prediction']['accuracy']:.1%}")
+ logger.info("\nWin Probability Calibration:")
+ for wp_bin, actual_rate in metrics["win_probability_calibration"].items():
+ logger.info(f" {wp_bin}% predicted → {actual_rate:.1f}% actual")
+
+ # Save detailed results
+ if args.output:
+ results_df.to_csv(args.output, index=False)
+ logger.info(f"\n[OK] Saved detailed results to {args.output}")
+
+ # Save metrics summary
+ metrics_path = args.output.with_suffix(".json")
+ import json
+
+ with open(metrics_path, "w") as f:
+ json.dump(metrics, f, indent=2, default=str)
+ logger.info(f"[OK] Saved metrics to {metrics_path}")
+
+
+def main() -> None:
+ """Entry point."""
+ parser = argparse.ArgumentParser(description="Backtest KenPom FanMatch predictions")
+ parser.add_argument("--start", type=str, help="Start date (YYYY-MM-DD)", default=None)
+ parser.add_argument("--end", type=str, help="End date (YYYY-MM-DD)", default=None)
+ parser.add_argument(
+ "--season",
+ type=int,
+ help="Season year (e.g., 2026 for 2025-26 season)",
+ default=None,
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Database path",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=Path("analysis/kenpom_fanmatch_backtest.csv"),
+ help="Output CSV path for detailed results",
+ )
+
+ args = parser.parse_args()
+
+ # Validate inputs
+ if args.season is None and (args.start is None or args.end is None):
+ parser.error("Either --season or both --start and --end are required")
+
+ # Run async main
+ asyncio.run(main_async(args))
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/check_coverage.py b/scripts/analysis/check_coverage.py
new file mode 100644
index 000000000..bb9aa3e47
--- /dev/null
+++ b/scripts/analysis/check_coverage.py
@@ -0,0 +1,15 @@
+"""Quick script to check team mapping coverage."""
+
+from __future__ import annotations
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+
+df = read_parquet_df("data/staging/mappings/team_mapping.parquet")
+
+odds_api_coverage = (df["odds_api_name"] != "").sum()
+espn_coverage = (df["espn_name"] != "").sum()
+
+print("Team Mapping Coverage:")
+print(f" Total teams: {len(df)}")
+print(f" Odds API: {odds_api_coverage}/{len(df)} ({odds_api_coverage / len(df) * 100:.1f}%)")
+print(f" ESPN: {espn_coverage}/{len(df)} ({espn_coverage / len(df) * 100:.1f}%)")
diff --git a/scripts/analysis/check_date_ranges.py b/scripts/analysis/check_date_ranges.py
new file mode 100644
index 000000000..a9c7535da
--- /dev/null
+++ b/scripts/analysis/check_date_ranges.py
@@ -0,0 +1,59 @@
+#!/usr/bin/env python3
+"""Check date ranges of events with odds vs scores."""
+
+import pandas as pd
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+db = OddsAPIDatabase("data/odds_api/odds_api.sqlite3")
+
+# Get all events
+events = pd.read_sql_query("SELECT event_id, commence_time, has_odds FROM events", db.conn)
+scored = pd.read_sql_query("SELECT event_id, completed FROM scores", db.conn)
+
+events["commence_date"] = pd.to_datetime(events["commence_time"], format="ISO8601").dt.date
+
+print("Events with odds:")
+odds_events = events[events["has_odds"] == 1]
+print(f" Date range: {odds_events['commence_date'].min()} to {odds_events['commence_date'].max()}")
+print(f" Count: {len(odds_events)}")
+
+print("")
+print("Events with scores:")
+scored_ids = set(scored["event_id"])
+scored_events = events[events["event_id"].isin(scored_ids)]
+date_min = scored_events["commence_date"].min()
+date_max = scored_events["commence_date"].max()
+print(f" Date range: {date_min} to {date_max}")
+print(f" Count: {len(scored_events)}")
+
+print("")
+print("Overlap:")
+overlap = odds_events[odds_events["event_id"].isin(scored_ids)]
+print(f" Events with BOTH: {len(overlap)}")
+if len(overlap) > 0:
+ print(f" Date range: {overlap['commence_date'].min()} to {overlap['commence_date'].max()}")
+else:
+ print(" [ISSUE] No overlap - odds and scores are for different games!")
+ print("")
+ print("This explains why line_features.parquet is empty:")
+ print("- Scores are for PAST games (already played)")
+ print("- Odds are for FUTURE games (not yet played)")
+ print("- The Odds API only provides odds for upcoming games (3-day lookback)")
+ print("")
+ print("SOLUTION: Need historical odds from odds_snapshots table")
+
+# Check odds_snapshots
+print("")
+print("Checking odds_snapshots table:")
+snapshots = pd.read_sql_query("SELECT COUNT(*) as cnt FROM odds_snapshots", db.conn)
+print(f" Snapshot records: {snapshots.iloc[0, 0]}")
+if snapshots.iloc[0, 0] > 0:
+ snapshot_dates = pd.read_sql_query(
+ "SELECT MIN(snapshot_date) as min_date, MAX(snapshot_date) as max_date FROM odds_snapshots",
+ db.conn,
+ )
+ print(f" Date range: {snapshot_dates.iloc[0, 0]} to {snapshot_dates.iloc[0, 1]}")
+else:
+ print(" [EMPTY] No historical snapshots available yet")
+ print(" Need to run archive_daily_odds.py to build historical data")
diff --git a/scripts/analysis/compare_opening_closing_lines.py b/scripts/analysis/compare_opening_closing_lines.py
new file mode 100644
index 000000000..5bb29430f
--- /dev/null
+++ b/scripts/analysis/compare_opening_closing_lines.py
@@ -0,0 +1,407 @@
+#!/usr/bin/env python3
+"""Compare opening lines vs closing lines for line movement analysis.
+
+Calculates CLV (Closing Line Value) metrics:
+- Spread movement (opening to closing)
+- Total movement (opening to closing)
+- Juice changes
+- Steam moves (significant line movements)
+
+Usage:
+ uv run python scripts/compare_opening_closing_lines.py --date 2026-02-05
+ uv run python scripts/compare_opening_closing_lines.py --date 2026-02-05 \\
+ --output data/clv_analysis.csv
+ uv run python scripts/compare_opening_closing_lines.py --date 2026-02-05 \\
+ --min-movement 1.0
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+from sqlalchemy import select
+
+from sports_betting_edge.adapters.database import (
+ OvertimeLineSnapshotDB,
+ OvertimeOpeningLineDB,
+ create_database_engine,
+ make_session_factory,
+)
+from sports_betting_edge.adapters.filesystem import write_csv
+
+logger = logging.getLogger(__name__)
+
+DEFAULT_DB_PATH = Path("data/source/overtime/overtime_lines.db")
+
+
+def get_opening_lines_for_date(
+ session, target_date: str
+) -> list[tuple[str, OvertimeOpeningLineDB]]:
+ """Get opening lines for games on a specific date.
+
+ Args:
+ session: SQLAlchemy session
+ target_date: Date string in format 'YYYY-MM-DD' or 'Thu Feb 5'
+
+ Returns:
+ List of (game_id, opening_line) tuples
+ """
+ # Convert various date formats to search pattern
+ # Examples: '2026-02-05', 'Thu Feb 5', 'Feb 5'
+ if len(target_date) == 10 and target_date.count("-") == 2:
+ # Format: YYYY-MM-DD -> convert to 'Mon DD' or 'Mon DD'
+ dt = datetime.strptime(target_date, "%Y-%m-%d")
+ # Remove leading zero from day (Windows compatible)
+ day = str(dt.day) # e.g., '5' not '05'
+ month = dt.strftime("%b") # e.g., 'Feb'
+ date_pattern = f"%{month} {day}%" # e.g., 'Feb 5'
+ else:
+ date_pattern = f"%{target_date}%"
+
+ stmt = select(OvertimeOpeningLineDB).where(
+ OvertimeOpeningLineDB.game_date_str.like(date_pattern)
+ )
+ openings = session.execute(stmt).scalars().all()
+
+ return [(opening.game_id, opening) for opening in openings]
+
+
+def get_latest_snapshot(
+ session, game_id: str, spread_threshold: float = 20.0, total_threshold: float = 100.0
+) -> OvertimeLineSnapshotDB | None:
+ """Get the most recent PRE-GAME snapshot (closing line) for a game.
+
+ Filters out post-game live betting lines which have unrealistic values
+ (e.g., spreads < 20, totals < 100 indicate live in-game lines).
+
+ Args:
+ session: SQLAlchemy session
+ game_id: Unique game identifier
+ spread_threshold: Min spread to consider valid pre-game line (default: 20.0)
+ total_threshold: Min total to consider valid pre-game line (default: 100.0)
+
+ Returns:
+ Latest pre-game snapshot or None if no valid snapshots exist
+ """
+ # Get all snapshots ordered by most recent first
+ stmt = (
+ select(OvertimeLineSnapshotDB)
+ .where(OvertimeLineSnapshotDB.game_id == game_id)
+ .order_by(OvertimeLineSnapshotDB.captured_at.desc())
+ )
+ snapshots = session.execute(stmt).scalars().all()
+
+ # Find first snapshot with realistic pre-game values
+ # Live betting lines typically have very low spreads/totals as game progresses
+ for snapshot in snapshots:
+ # Check if this looks like a pre-game line (not live betting)
+ spread_ok = (
+ snapshot.spread_magnitude is None or snapshot.spread_magnitude >= 0.5
+ ) # Allow any spread >= 0.5
+ total_ok = snapshot.total_points is None or snapshot.total_points >= total_threshold
+
+ if spread_ok and total_ok:
+ return snapshot
+
+ # If no valid snapshots, return None
+ return None
+
+
+def calculate_line_movement(
+ opening: OvertimeOpeningLineDB, closing: OvertimeLineSnapshotDB | None
+) -> dict:
+ """Calculate line movement from opening to closing.
+
+ Args:
+ opening: Opening line
+ closing: Closing line (latest snapshot) or None
+
+ Returns:
+ Dict with movement metrics
+ """
+ if closing is None:
+ return {
+ "game_id": opening.game_id,
+ "category": opening.category,
+ "away_team": opening.away_team,
+ "home_team": opening.home_team,
+ "game_date": opening.game_date_str,
+ "game_time": opening.game_time_str,
+ # Opening lines
+ "open_spread": opening.spread_magnitude,
+ "open_favorite": opening.favorite_team,
+ "open_fav_price": opening.spread_favorite_price,
+ "open_dog_price": opening.spread_underdog_price,
+ "open_total": opening.total_points,
+ "open_over_price": opening.total_over_price,
+ "open_under_price": opening.total_under_price,
+ # Closing lines
+ "close_spread": None,
+ "close_favorite": None,
+ "close_fav_price": None,
+ "close_dog_price": None,
+ "close_total": None,
+ "close_over_price": None,
+ "close_under_price": None,
+ # Movement
+ "spread_movement": None,
+ "total_movement": None,
+ "fav_juice_change": None,
+ "dog_juice_change": None,
+ "has_snapshots": False,
+ "opened_at": opening.opened_at,
+ "closed_at": None,
+ }
+
+ # Calculate movements
+ spread_move = None
+ if opening.spread_magnitude and closing.spread_magnitude:
+ spread_move = closing.spread_magnitude - opening.spread_magnitude
+
+ total_move = None
+ if opening.total_points and closing.total_points:
+ total_move = closing.total_points - opening.total_points
+
+ fav_juice_change = None
+ if opening.spread_favorite_price and closing.spread_favorite_price:
+ fav_juice_change = closing.spread_favorite_price - opening.spread_favorite_price
+
+ dog_juice_change = None
+ if opening.spread_underdog_price and closing.spread_underdog_price:
+ dog_juice_change = closing.spread_underdog_price - opening.spread_underdog_price
+
+ return {
+ "game_id": opening.game_id,
+ "category": opening.category,
+ "away_team": opening.away_team,
+ "home_team": opening.home_team,
+ "game_date": opening.game_date_str,
+ "game_time": opening.game_time_str,
+ # Opening lines
+ "open_spread": opening.spread_magnitude,
+ "open_favorite": opening.favorite_team,
+ "open_fav_price": opening.spread_favorite_price,
+ "open_dog_price": opening.spread_underdog_price,
+ "open_total": opening.total_points,
+ "open_over_price": opening.total_over_price,
+ "open_under_price": opening.total_under_price,
+ # Closing lines
+ "close_spread": closing.spread_magnitude,
+ "close_favorite": closing.favorite_team,
+ "close_fav_price": closing.spread_favorite_price,
+ "close_dog_price": closing.spread_underdog_price,
+ "close_total": closing.total_points,
+ "close_over_price": closing.total_over_price,
+ "close_under_price": closing.total_under_price,
+ # Movement
+ "spread_movement": spread_move,
+ "total_movement": total_move,
+ "fav_juice_change": fav_juice_change,
+ "dog_juice_change": dog_juice_change,
+ "has_snapshots": True,
+ "opened_at": opening.opened_at,
+ "closed_at": closing.captured_at,
+ }
+
+
+def compare_lines(db_path: Path | str, target_date: str, min_movement: float = 0.0) -> pd.DataFrame:
+ """Compare opening vs closing lines for a specific date.
+
+ Args:
+ db_path: Path to SQLite database
+ target_date: Date string (e.g., '2026-02-05' or 'Thu Feb 5')
+ min_movement: Minimum spread/total movement to include (default: 0.0 = all)
+
+ Returns:
+ DataFrame with line movement analysis
+ """
+ db_url = f"sqlite:///{db_path}"
+ engine = create_database_engine(db_url)
+ SessionFactory = make_session_factory(engine)
+
+ with SessionFactory() as session:
+ # Get all opening lines for the date
+ openings = get_opening_lines_for_date(session, target_date)
+ logger.info("Found %d opening lines for %s", len(openings), target_date)
+
+ if len(openings) == 0:
+ logger.warning("No opening lines found for date: %s", target_date)
+ return pd.DataFrame()
+
+ # Compare each opening to its closing line
+ movements = []
+ for game_id, opening in openings:
+ closing = get_latest_snapshot(session, game_id)
+ movement = calculate_line_movement(opening, closing)
+ movements.append(movement)
+
+ df = pd.DataFrame(movements)
+
+ # Filter by minimum movement if specified
+ if min_movement > 0:
+ df = df[
+ (df["spread_movement"].abs() >= min_movement)
+ | (df["total_movement"].abs() >= min_movement)
+ ].copy()
+
+ # Sort by biggest movements
+ df["total_abs_movement"] = (
+ df["spread_movement"].fillna(0).abs() + df["total_movement"].fillna(0).abs()
+ )
+ df = df.sort_values("total_abs_movement", ascending=False)
+ df = df.drop(columns=["total_abs_movement"])
+
+ logger.info("Calculated line movements for %d games", len(df))
+ return df
+
+
+def print_movement_summary(df: pd.DataFrame) -> None:
+ """Print summary statistics for line movements.
+
+ Args:
+ df: DataFrame with line movement data
+ """
+ if len(df) == 0:
+ print("\n[WARNING] No line movements to analyze")
+ return
+
+ print("\n=== Line Movement Summary ===")
+ print(f"Total games analyzed: {len(df)}")
+ print()
+
+ # Games with/without snapshots
+ has_snapshots = df["has_snapshots"].sum()
+ no_snapshots = len(df) - has_snapshots
+ print(f"Games with closing lines: {has_snapshots}")
+ print(f"Games without closing lines: {no_snapshots}")
+ print()
+
+ if has_snapshots == 0:
+ print("[WARNING] No closing lines available for comparison")
+ return
+
+ # Filter to games with snapshots for stats
+ moved = df[df["has_snapshots"]].copy()
+
+ # Spread movement stats
+ spread_moves = moved[moved["spread_movement"].notna()]
+ if len(spread_moves) > 0:
+ print("Spread Movement:")
+ print(f" Average: {spread_moves['spread_movement'].mean():+.2f} points")
+ print(f" Median: {spread_moves['spread_movement'].median():+.2f} points")
+ min_move = spread_moves["spread_movement"].min()
+ max_move = spread_moves["spread_movement"].max()
+ print(f" Range: {min_move:+.2f} to {max_move:+.2f}")
+ moved = (spread_moves["spread_movement"] != 0).sum()
+ print(f" Games with movement: {moved} / {len(spread_moves)}")
+ print()
+
+ # Total movement stats
+ total_moves = moved[moved["total_movement"].notna()]
+ if len(total_moves) > 0:
+ print("Total Movement:")
+ print(f" Average: {total_moves['total_movement'].mean():+.2f} points")
+ print(f" Median: {total_moves['total_movement'].median():+.2f} points")
+ min_move = total_moves["total_movement"].min()
+ max_move = total_moves["total_movement"].max()
+ print(f" Range: {min_move:+.2f} to {max_move:+.2f}")
+ moved = (total_moves["total_movement"] != 0).sum()
+ print(f" Games with movement: {moved} / {len(total_moves)}")
+ print()
+
+ # Biggest movers
+ print("Top 10 Spread Movements:")
+ print("-" * 100)
+ top_spread = moved.nlargest(10, "spread_movement", keep="all")[
+ [
+ "away_team",
+ "home_team",
+ "open_spread",
+ "close_spread",
+ "spread_movement",
+ "open_favorite",
+ ]
+ ]
+ for _, row in top_spread.iterrows():
+ teams = f"{row['away_team']:30s} @ {row['home_team']:30s}"
+ lines = f"{row['open_spread']:5.1f} -> {row['close_spread']:5.1f}"
+ move = f"({row['spread_movement']:+.1f})"
+ print(f" {teams} | {lines} {move}")
+
+ print()
+ print("Top 10 Total Movements:")
+ print("-" * 100)
+ top_total = moved.nlargest(10, "total_movement", keep="all")[
+ ["away_team", "home_team", "open_total", "close_total", "total_movement"]
+ ]
+ for _, row in top_total.iterrows():
+ print(
+ f" {row['away_team']:30s} @ {row['home_team']:30s} | "
+ f"{row['open_total']:5.1f} -> {row['close_total']:5.1f} ({row['total_movement']:+.1f})"
+ )
+
+
+def main() -> None:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(
+ description="Compare opening vs closing lines for CLV analysis"
+ )
+ parser.add_argument(
+ "--date",
+ required=True,
+ help="Target date (e.g., '2026-02-05', 'Thu Feb 5', or 'Feb 5')",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=DEFAULT_DB_PATH,
+ help="Path to SQLite database (default: data/source/overtime/overtime_lines.db)",
+ )
+ parser.add_argument(
+ "--output",
+ "-o",
+ type=Path,
+ help="Output CSV file path (default: print summary only)",
+ )
+ parser.add_argument(
+ "--min-movement",
+ type=float,
+ default=0.0,
+ help="Minimum spread/total movement to include (default: 0.0 = all games)",
+ )
+ parser.add_argument("--verbose", "-v", action="store_true", help="Enable debug logging")
+
+ args = parser.parse_args()
+
+ # Configure logging
+ logging.basicConfig(
+ level=logging.DEBUG if args.verbose else logging.INFO,
+ format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+ )
+
+ # Compare lines
+ df = compare_lines(
+ db_path=args.db_path,
+ target_date=args.date,
+ min_movement=args.min_movement,
+ )
+
+ if len(df) == 0:
+ logger.warning("No games found for date: %s", args.date)
+ return
+
+ # Output results
+ if args.output:
+ write_csv(df, args.output, index=False)
+ logger.info("Wrote %d line movements to %s", len(df), args.output)
+
+ # Print summary
+ print_movement_summary(df)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/compare_overtime_vs_kenpom.py b/scripts/analysis/compare_overtime_vs_kenpom.py
new file mode 100644
index 000000000..7fbf3cfe1
--- /dev/null
+++ b/scripts/analysis/compare_overtime_vs_kenpom.py
@@ -0,0 +1,673 @@
+"""Compare Overtime.ag odds vs KenPom FanMatch predictions.
+
+Loads Overtime closing lines and KenPom FanMatch predictions for a given
+date, matches them by team names, and compares predicted spreads/totals
+against each other and actual results.
+
+Usage:
+ uv run python scripts/analysis/compare_overtime_vs_kenpom.py --date 2026-02-08
+ uv run python scripts/analysis/compare_overtime_vs_kenpom.py --date 2026-02-09
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import glob
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.kenpom import KenPomAdapter
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+# ── Team name normalization for cross-source matching ──────────────────
+
+TEAM_ALIASES: dict[str, str] = {
+ # KenPom name -> Overtime/canonical name
+ "South Fla.": "South Florida",
+ "Penn St.": "Penn State",
+ "Ohio St.": "Ohio State",
+ "UNC Greensboro": "NC Greensboro",
+ "UNCG": "NC Greensboro",
+ "Wichita St.": "Wichita State",
+ "Charlotte": "Charlotte U",
+ "UCF": "Central Florida",
+ "Mich. St.": "Michigan State",
+ "N.C. State": "NC State",
+ "Ill.": "Illinois",
+}
+
+
+def normalize_team(name: str) -> str:
+ """Normalize team name for cross-source matching."""
+ if name in TEAM_ALIASES:
+ return TEAM_ALIASES[name]
+ # Strip common suffixes
+ for suffix in [
+ " Bulldogs",
+ " Tigers",
+ " Bears",
+ " Wolverines",
+ " Buckeyes",
+ " Red Raiders",
+ " Mountaineers",
+ " Paladins",
+ " Spartans",
+ " Shockers",
+ " Green Wave",
+ " 49ers",
+ " Bearcats",
+ " Knights",
+ " Golden Gophers",
+ " Terrapins",
+ " Blazers",
+ " Owls",
+ " Hawkeyes",
+ " Wildcats",
+ " Bulls",
+ " Golden Hurricane",
+ " Nittany Lions",
+ " Trojans",
+ " Rainbow Warriors",
+ " Tritons",
+ " Cougars",
+ " Gaels",
+ " Dons",
+ ]:
+ if name.endswith(suffix):
+ return name[: -len(suffix)].strip()
+ return name
+
+
+def fuzzy_match(name1: str, name2: str) -> bool:
+ """Check if two team names likely refer to the same team."""
+ n1 = normalize_team(name1).lower()
+ n2 = normalize_team(name2).lower()
+ # Exact match after normalization
+ if n1 == n2:
+ return True
+ # One contains the other
+ if n1 in n2 or n2 in n1:
+ return True
+ # Check key words overlap
+ words1 = set(n1.split())
+ words2 = set(n2.split())
+ overlap = words1 & words2
+ # If significant word overlap (excluding common words)
+ common = {"st", "state", "u", "university"}
+ meaningful = overlap - common
+ return len(meaningful) >= 1
+
+
+# ── Data loading ───────────────────────────────────────────────────────
+
+
+def load_overtime_closing_lines(target_date: str) -> pd.DataFrame:
+ """Load the last Overtime snapshot for each game on the target date.
+
+ For each game, finds the most recent snapshot that still has a
+ full-game line (before the game went live/in-progress). This gives
+ the closing line for each game.
+
+ Args:
+ target_date: Date in YYYY-MM-DD format
+
+ Returns:
+ DataFrame with one row per game (full-game closing lines)
+ """
+ pattern = (
+ f"data/source/overtime/api/college_basketball/college_basketball_{target_date}_*.parquet"
+ )
+ files = sorted(glob.glob(pattern))
+ if not files:
+ logger.warning(f"No Overtime files found for {target_date}")
+ return pd.DataFrame()
+
+ logger.info(f"Found {len(files)} Overtime snapshots for {target_date}")
+
+ # Read ALL snapshots and find last full-game line per game
+ closing_lines: dict[str, pd.Series] = {} # game_key -> last row
+
+ for f in files:
+ df = pd.read_parquet(f)
+ full_game = df[df["period_description"] == "Game"]
+ for _, row in full_game.iterrows():
+ game_key = f"{row['team1_id']}_{row['team2_id']}"
+ closing_lines[game_key] = row
+
+ if not closing_lines:
+ logger.warning("No full-game lines found in any snapshot")
+ return pd.DataFrame()
+
+ full_game = pd.DataFrame(closing_lines.values())
+ logger.info(f" Closing lines: {len(full_game)} games (per-game latest)")
+
+ # Rename for clarity
+ full_game = full_game.rename(
+ columns={
+ "team1_id": "ot_away_team",
+ "team2_id": "ot_home_team",
+ "spread_magnitude": "ot_spread_magnitude",
+ "favorite_team": "ot_favorite",
+ "total_points": "ot_total",
+ }
+ )
+
+ # Compute Overtime predicted margin (home perspective)
+ # Positive = home wins, negative = away wins
+ # If home is favored by 5, predicted margin = +5
+ # If away is favored by 5, predicted margin = -5
+ full_game["ot_predicted_margin"] = full_game.apply(
+ lambda r: r["ot_spread_magnitude"]
+ if fuzzy_match(r["ot_favorite"], r["ot_home_team"])
+ else -r["ot_spread_magnitude"],
+ axis=1,
+ )
+
+ return full_game[
+ [
+ "ot_away_team",
+ "ot_home_team",
+ "ot_spread_magnitude",
+ "ot_favorite",
+ "ot_total",
+ "ot_predicted_margin",
+ "game_datetime",
+ "captured_at",
+ ]
+ ]
+
+
+async def fetch_kenpom_fanmatch(target_date: str) -> pd.DataFrame:
+ """Fetch KenPom FanMatch predictions for a date.
+
+ Args:
+ target_date: Date in YYYY-MM-DD format
+
+ Returns:
+ DataFrame with KenPom predictions per game
+ """
+ kenpom = KenPomAdapter()
+ try:
+ games = await kenpom.get_fanmatch(target_date)
+ logger.info(f"KenPom FanMatch: {len(games)} games for {target_date}")
+ finally:
+ await kenpom.close()
+
+ if not games:
+ return pd.DataFrame()
+
+ rows = []
+ for g in games:
+ home = g.get("Home") or g.get("Team2")
+ visitor = g.get("Visitor") or g.get("Team1")
+ home_pred = g.get("HomePred")
+ visitor_pred = g.get("VisitorPred")
+
+ # Some API responses use different field names
+ if home_pred is None:
+ # Try alternate field names
+ predicted_score = g.get("PredictedScore", "")
+ if predicted_score and "-" in str(predicted_score):
+ parts = str(predicted_score).split("-")
+ try:
+ s1, s2 = float(parts[0]), float(parts[1])
+ winner = g.get("PredictedWinner", "")
+ if winner and fuzzy_match(winner, home or ""):
+ home_pred, visitor_pred = s1, s2
+ else:
+ home_pred, visitor_pred = s2, s1
+ except (ValueError, IndexError):
+ pass
+
+ kp_spread = None
+ kp_total = None
+ if home_pred is not None and visitor_pred is not None:
+ kp_spread = float(home_pred) - float(visitor_pred)
+ kp_total = float(home_pred) + float(visitor_pred)
+
+ rows.append(
+ {
+ "kp_home": home,
+ "kp_visitor": visitor,
+ "kp_home_pred": home_pred,
+ "kp_visitor_pred": visitor_pred,
+ "kp_home_spread": kp_spread,
+ "kp_total": kp_total,
+ "kp_home_wp": g.get("HomeWP"),
+ "kp_pred_tempo": g.get("PredTempo"),
+ "kp_thrill_score": g.get("ThrillScore"),
+ "kp_predicted_winner": g.get("PredictedWinner"),
+ "kp_predicted_mov": g.get("PredictedMOV"),
+ }
+ )
+
+ return pd.DataFrame(rows)
+
+
+def load_actual_results(
+ target_date: str, db_path: str = "data/odds_api/odds_api.sqlite3"
+) -> pd.DataFrame:
+ """Load actual game results from Odds API database.
+
+ Args:
+ target_date: Date in YYYY-MM-DD format
+ db_path: Path to Odds API SQLite database
+
+ Returns:
+ DataFrame with actual scores
+ """
+ db = OddsAPIDatabase(db_path)
+ query = """
+ SELECT
+ e.event_id,
+ e.home_team,
+ e.away_team,
+ e.commence_time,
+ s.home_score,
+ s.away_score
+ FROM events e
+ INNER JOIN scores s ON e.event_id = s.event_id
+ WHERE DATE(e.commence_time) = ?
+ AND s.completed = 1
+ AND s.home_score IS NOT NULL
+ AND e.sport_key = 'basketball_ncaab'
+ ORDER BY e.commence_time
+ """
+ df = pd.read_sql_query(query, db.conn, params=[target_date])
+ logger.info(f"Actual results: {len(df)} completed games for {target_date}")
+
+ df["actual_home_score"] = df["home_score"].astype(float)
+ df["actual_away_score"] = df["away_score"].astype(float)
+ df["actual_margin"] = df["actual_home_score"] - df["actual_away_score"]
+ df["actual_total"] = df["actual_home_score"] + df["actual_away_score"]
+
+ return df
+
+
+# ── Matching logic ─────────────────────────────────────────────────────
+
+
+def match_games(
+ overtime_df: pd.DataFrame,
+ kenpom_df: pd.DataFrame,
+ actuals_df: pd.DataFrame,
+) -> pd.DataFrame:
+ """Match games across Overtime, KenPom, and actual results.
+
+ Uses fuzzy team name matching to join the three sources.
+
+ Returns:
+ Merged DataFrame with all available data per game
+ """
+ matched = []
+
+ for _, ot in overtime_df.iterrows():
+ ot_home = ot["ot_home_team"]
+ ot_away = ot["ot_away_team"]
+
+ row: dict = {**ot.to_dict()}
+
+ # Match KenPom
+ kp_match = None
+ for _, kp in kenpom_df.iterrows():
+ if fuzzy_match(kp["kp_home"], ot_home) and fuzzy_match(kp["kp_visitor"], ot_away):
+ kp_match = kp
+ break
+ # Try reversed (sometimes home/away differ)
+ if fuzzy_match(kp["kp_home"], ot_away) and fuzzy_match(kp["kp_visitor"], ot_home):
+ kp_match = kp
+ # Flip the spread since home/away reversed
+ row["kp_home_flipped"] = True
+ break
+
+ if kp_match is not None:
+ for col in kenpom_df.columns:
+ row[col] = kp_match[col]
+ else:
+ logger.warning(f" No KenPom match for: {ot_away} @ {ot_home}")
+
+ # Match actual results
+ act_match = None
+ for _, act in actuals_df.iterrows():
+ espn_home = normalize_team(act["home_team"])
+ espn_away = normalize_team(act["away_team"])
+ if fuzzy_match(espn_home, ot_home) and fuzzy_match(espn_away, ot_away):
+ act_match = act
+ break
+
+ if act_match is not None:
+ row["actual_home_score"] = act_match["actual_home_score"]
+ row["actual_away_score"] = act_match["actual_away_score"]
+ row["actual_margin"] = act_match["actual_margin"]
+ row["actual_total"] = act_match["actual_total"]
+ else:
+ logger.warning(f" No actual result for: {ot_away} @ {ot_home}")
+
+ matched.append(row)
+
+ return pd.DataFrame(matched)
+
+
+# ── Analysis ───────────────────────────────────────────────────────────
+
+
+def analyze_comparison(df: pd.DataFrame) -> None:
+ """Print comparison analysis between Overtime and KenPom.
+
+ All spread/margin values use score-difference convention:
+ Positive = home team wins/favored
+ Negative = away team wins/favored
+ """
+ has_kp = df["kp_home_spread"].notna()
+ has_actual = df["actual_margin"].notna()
+ both = has_kp & has_actual
+
+ print("\n" + "=" * 110)
+ print("OVERTIME vs KENPOM COMPARISON")
+ print("=" * 110)
+ print("(Spread = absolute pts, favorite team shown. Totals = predicted combined score.)")
+
+ # ── Helper: format spread as "5.0 (TeamName)" ──
+ def fmt_spread(margin: float | None, home: str, away: str) -> str:
+ if margin is None or pd.isna(margin):
+ return "N/A"
+ magnitude = abs(margin)
+ fav = home if margin > 0 else away
+ # Shorten team name for display
+ short_fav = fav.split()[-1] if len(fav) > 10 else fav
+ return f"{magnitude:.1f} ({short_fav})"
+
+ def fmt_winner(margin: float | None, home: str, away: str) -> str:
+ if margin is None or pd.isna(margin):
+ return "N/A"
+ winner = home if margin > 0 else away
+ short = winner.split()[-1] if len(winner) > 10 else winner
+ return f"{abs(margin):.0f} ({short})"
+
+ # ── Game-by-game comparison ──
+ print(
+ f"\n{'Game':<30} {'OT Spread':<18} {'KP Spread':<18} "
+ f"{'OT Total':>9} {'KP Total':>9} {'Result':<18}"
+ )
+ print("-" * 110)
+
+ for _, r in df.iterrows():
+ home = r["ot_home_team"]
+ away = r["ot_away_team"]
+ game_label = f"{away} @ {home}"
+ ot_spread = fmt_spread(r.get("ot_predicted_margin"), home, away)
+ kp_spread = fmt_spread(r.get("kp_home_spread"), home, away)
+ ot_total = f"{r['ot_total']:.1f}" if pd.notna(r.get("ot_total")) else "N/A"
+ kp_total = f"{r['kp_total']:.1f}" if pd.notna(r.get("kp_total")) else "N/A"
+ actual = ""
+ if pd.notna(r.get("actual_margin")):
+ actual = fmt_winner(r["actual_margin"], home, away)
+ actual += f" [{r['actual_total']:.0f}]"
+ print(
+ f"{game_label:<30} {ot_spread:<18} {kp_spread:<18} "
+ f"{ot_total:>9} {kp_total:>9} {actual:<18}"
+ )
+
+ if not both.any():
+ print("\n[WARNING] No games with both KenPom predictions and actual results")
+ return
+
+ matched = df[both].copy()
+ n = len(matched)
+
+ # ── Spread/Margin comparison ──
+ print(f"\n{'=' * 80}")
+ print(f"MARGIN PREDICTION ACCURACY ({n} games)")
+ print(f"{'=' * 80}")
+
+ ot_margin_err = (matched["ot_predicted_margin"] - matched["actual_margin"]).abs()
+ kp_margin_err = (matched["kp_home_spread"] - matched["actual_margin"]).abs()
+
+ print(f"\n Overtime margin MAE: {ot_margin_err.mean():.2f} pts")
+ print(f" KenPom margin MAE: {kp_margin_err.mean():.2f} pts")
+ diff = ot_margin_err.mean() - kp_margin_err.mean()
+ winner = "KenPom" if diff > 0 else "Overtime"
+ print(f" -> {winner} closer by {abs(diff):.2f} pts on average")
+
+ # Bias (do they systematically over/under predict?)
+ ot_bias = (matched["ot_predicted_margin"] - matched["actual_margin"]).mean()
+ kp_bias = (matched["kp_home_spread"] - matched["actual_margin"]).mean()
+ print(f"\n Overtime margin bias: {ot_bias:+.2f} pts (predicted - actual)")
+ print(f" KenPom margin bias: {kp_bias:+.2f} pts (predicted - actual)")
+
+ # Who picked the right side (winner)?
+ ot_right_side = ((matched["ot_predicted_margin"] > 0) == (matched["actual_margin"] > 0)).mean()
+ kp_right_side = ((matched["kp_home_spread"] > 0) == (matched["actual_margin"] > 0)).mean()
+ print(f"\n Overtime correct winner: {ot_right_side:.1%}")
+ print(f" KenPom correct winner: {kp_right_side:.1%}")
+
+ # ATS: did the actual result cover the Overtime line?
+ # Cover = actual_margin - ot_predicted_margin
+ # Positive = home outperformed the line
+ cover = matched["actual_margin"] - matched["ot_predicted_margin"]
+ print("\n Games vs Overtime spread:")
+ print(f" Home covers: {(cover > 0).sum()} / {n}")
+ print(f" Away covers: {(cover < 0).sum()} / {n}")
+ print(f" Push: {(cover == 0).sum()} / {n}")
+
+ # Did KenPom predict the cover direction?
+ # If KP margin > OT margin, KP says home side; if < OT, KP says away side
+ kp_edge = matched["kp_home_spread"] - matched["ot_predicted_margin"]
+ kp_ats_correct = ((kp_edge > 0) & (cover > 0)) | ((kp_edge < 0) & (cover < 0))
+ print(f"\n KenPom ATS accuracy vs OT line: {kp_ats_correct.mean():.1%}")
+ print(" (When KP disagrees with OT, does actual follow KP's direction?)")
+
+ # ── Total comparison ──
+ print(f"\n{'=' * 80}")
+ print(f"TOTALS PREDICTION ACCURACY ({n} games)")
+ print(f"{'=' * 80}")
+
+ ot_total_error = (matched["ot_total"] - matched["actual_total"]).abs()
+ kp_total_error = (matched["kp_total"] - matched["actual_total"]).abs()
+
+ print(f"\n Overtime total MAE: {ot_total_error.mean():.2f} pts")
+ print(f" KenPom total MAE: {kp_total_error.mean():.2f} pts")
+ diff = ot_total_error.mean() - kp_total_error.mean()
+ winner = "KenPom" if diff > 0 else "Overtime"
+ print(f" -> {winner} closer by {abs(diff):.2f} pts on average")
+
+ ot_total_bias = (matched["ot_total"] - matched["actual_total"]).mean()
+ kp_total_bias = (matched["kp_total"] - matched["actual_total"]).mean()
+ print(f"\n Overtime total bias: {ot_total_bias:+.2f} pts (predicted - actual)")
+ print(f" KenPom total bias: {kp_total_bias:+.2f} pts (predicted - actual)")
+
+ # Over/Under accuracy
+ overs = (matched["actual_total"] > matched["ot_total"]).sum()
+ unders = (matched["actual_total"] < matched["ot_total"]).sum()
+ print(f"\n Actual vs OT line: {overs} overs, {unders} unders")
+
+ kp_predicted_over = (matched["kp_total"] > matched["ot_total"]).sum()
+ kp_predicted_under = (matched["kp_total"] < matched["ot_total"]).sum()
+ print(f" KenPom vs OT line: {kp_predicted_over} overs, {kp_predicted_under} unders")
+
+ # KenPom O/U accuracy
+ kp_ou_correct = (
+ (matched["kp_total"] > matched["ot_total"])
+ & (matched["actual_total"] > matched["ot_total"])
+ ) | (
+ (matched["kp_total"] < matched["ot_total"])
+ & (matched["actual_total"] < matched["ot_total"])
+ )
+ print(f" KenPom O/U accuracy vs OT: {kp_ou_correct.mean():.1%}")
+
+ # ── Disagreement analysis ──
+ print(f"\n{'=' * 80}")
+ print("DISAGREEMENT ANALYSIS")
+ print(f"{'=' * 80}")
+
+ margin_diff = matched["kp_home_spread"] - matched["ot_predicted_margin"]
+ total_diff = matched["kp_total"] - matched["ot_total"]
+
+ print(f"\n Avg margin disagreement: {margin_diff.abs().mean():.2f} pts")
+ print(f" Max margin disagreement: {margin_diff.abs().max():.2f} pts")
+ print(f" Avg total disagreement: {total_diff.abs().mean():.2f} pts")
+ print(f" Max total disagreement: {total_diff.abs().max():.2f} pts")
+
+ # Game-by-game comparison where they disagree
+ disagree_mask = margin_diff.abs() > 1.0
+ if disagree_mask.any():
+ disagree = matched[disagree_mask]
+ print(f"\n Games where KP & OT disagree by >1 pt ({disagree_mask.sum()}):")
+ for _, r in disagree.iterrows():
+ home = r["ot_home_team"]
+ away = r["ot_away_team"]
+ game = f"{away} @ {home}"
+ ot_mag = abs(r["ot_predicted_margin"])
+ ot_fav = home if r["ot_predicted_margin"] > 0 else away
+ kp_mag = abs(r["kp_home_spread"])
+ kp_fav = home if r["kp_home_spread"] > 0 else away
+ act_mag = abs(r["actual_margin"])
+ act_won = home if r["actual_margin"] > 0 else away
+ ot_err = abs(r["ot_predicted_margin"] - r["actual_margin"])
+ kp_err = abs(r["kp_home_spread"] - r["actual_margin"])
+ closer = "OT" if ot_err < kp_err else "KP"
+ print(
+ f" {game:<28} "
+ f"OT: {ot_mag:.1f} ({ot_fav[:8]}) "
+ f"KP: {kp_mag:.1f} ({kp_fav[:8]}) "
+ f"Result: {act_mag:.0f} ({act_won[:8]}) "
+ f"-> {closer} closer"
+ )
+
+ # ── Edge detection ──
+ print(f"\n{'=' * 80}")
+ print("EDGE DETECTION: Can disagreement predict accuracy?")
+ print(f"{'=' * 80}")
+
+ matched = matched.copy()
+ matched["margin_disagree"] = margin_diff.abs()
+ matched["total_disagree"] = total_diff.abs()
+ matched["ot_margin_err"] = ot_margin_err
+ matched["kp_margin_err"] = kp_margin_err
+ matched["ot_total_err"] = ot_total_error
+ matched["kp_total_err"] = kp_total_error
+
+ if n >= 4:
+ median_disagree = matched["margin_disagree"].median()
+ high = matched[matched["margin_disagree"] >= median_disagree]
+ low = matched[matched["margin_disagree"] < median_disagree]
+
+ print(f"\n High margin disagreement (>={median_disagree:.1f} pts): {len(high)} games")
+ if len(high) > 0:
+ print(f" OT margin MAE: {high['ot_margin_err'].mean():.2f}")
+ print(f" KP margin MAE: {high['kp_margin_err'].mean():.2f}")
+ kp_better = (high["kp_margin_err"] < high["ot_margin_err"]).mean()
+ print(f" KP closer: {kp_better:.1%}")
+
+ print(f"\n Low margin disagreement (<{median_disagree:.1f} pts): {len(low)} games")
+ if len(low) > 0:
+ print(f" OT margin MAE: {low['ot_margin_err'].mean():.2f}")
+ print(f" KP margin MAE: {low['kp_margin_err'].mean():.2f}")
+ kp_better = (low["kp_margin_err"] < low["ot_margin_err"]).mean()
+ print(f" KP closer: {kp_better:.1%}")
+
+ # ── Blended prediction ──
+ print(f"\n{'=' * 80}")
+ print("BLENDED PREDICTION (50/50 OT + KP)")
+ print(f"{'=' * 80}")
+
+ blend_margin = (matched["ot_predicted_margin"] + matched["kp_home_spread"]) / 2
+ blend_total = (matched["ot_total"] + matched["kp_total"]) / 2
+ blend_margin_err = (blend_margin - matched["actual_margin"]).abs()
+ blend_total_err = (blend_total - matched["actual_total"]).abs()
+
+ print(f"\n Blended margin MAE: {blend_margin_err.mean():.2f} pts")
+ print(f" vs Overtime alone: {ot_margin_err.mean():.2f}")
+ print(f" vs KenPom alone: {kp_margin_err.mean():.2f}")
+ print(f"\n Blended total MAE: {blend_total_err.mean():.2f} pts")
+ print(f" vs Overtime alone: {ot_total_error.mean():.2f}")
+ print(f" vs KenPom alone: {kp_total_error.mean():.2f}")
+
+ # ── Summary ──
+ print(f"\n{'=' * 80}")
+ print("SUMMARY")
+ print(f"{'=' * 80}")
+ print(f"\n Games analyzed: {n}")
+ print(f" {'Source':<20} {'Margin MAE':>12} {'Total MAE':>12} {'Winner %':>10}")
+ print(f" {'-' * 56}")
+ print(
+ f" {'Overtime (market)':<20} {ot_margin_err.mean():>12.2f} "
+ f"{ot_total_error.mean():>12.2f} {ot_right_side:>10.1%}"
+ )
+ print(
+ f" {'KenPom (model)':<20} {kp_margin_err.mean():>12.2f} "
+ f"{kp_total_error.mean():>12.2f} {kp_right_side:>10.1%}"
+ )
+ print(
+ f" {'Blend (50/50)':<20} {blend_margin_err.mean():>12.2f} {blend_total_err.mean():>12.2f}"
+ )
+ print()
+
+
+# ── Main ───────────────────────────────────────────────────────────────
+
+
+async def main_async(target_date: str, output_path: Path | None) -> None:
+ """Run the comparison analysis."""
+ logger.info(f"[OK] Comparing Overtime vs KenPom for {target_date}\n")
+
+ # Load data from all three sources
+ overtime_df = load_overtime_closing_lines(target_date)
+ if overtime_df.empty:
+ logger.error("No Overtime data found. Exiting.")
+ return
+
+ kenpom_df = await fetch_kenpom_fanmatch(target_date)
+ if kenpom_df.empty:
+ logger.warning("No KenPom FanMatch data found.")
+
+ actuals_df = load_actual_results(target_date)
+
+ # Match and merge
+ merged = match_games(overtime_df, kenpom_df, actuals_df)
+
+ # Print analysis
+ analyze_comparison(merged)
+
+ # Save output
+ if output_path:
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ merged.to_csv(output_path, index=False)
+ logger.info(f"[OK] Saved comparison to {output_path}")
+
+
+def main() -> None:
+ """Entry point."""
+ parser = argparse.ArgumentParser(
+ description="Compare Overtime.ag odds vs KenPom FanMatch predictions"
+ )
+ parser.add_argument(
+ "--date",
+ type=str,
+ required=True,
+ help="Target date (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=None,
+ help="Output CSV path (default: data/reports/ot_vs_kp_.csv)",
+ )
+ args = parser.parse_args()
+
+ if args.output is None:
+ args.output = Path(f"data/reports/ot_vs_kp_{args.date}.csv")
+
+ asyncio.run(main_async(args.date, args.output))
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/compare_predictions.py b/scripts/analysis/compare_predictions.py
new file mode 100644
index 000000000..f3b665546
--- /dev/null
+++ b/scripts/analysis/compare_predictions.py
@@ -0,0 +1,214 @@
+"""Compare KenPom and ML model predictions."""
+
+from __future__ import annotations
+
+import numpy as np
+import pandas as pd
+
+
+def main() -> None:
+ """Compare KenPom score predictions with ML model predictions."""
+ # Load both prediction sets
+ kenpom = pd.read_csv("data/outputs/predictions/scores_2026-02-05.csv")
+ ml_model = pd.read_csv("data/outputs/predictions/2026-02-05_retrained.csv")
+
+ print("=" * 100)
+ print("KENPOM vs ML MODEL COMPARISON")
+ print("=" * 100)
+ print()
+
+ # Merge on home_team and away_team
+ merged = ml_model.merge(
+ kenpom[
+ [
+ "home_team",
+ "away_team",
+ "predicted_home_score",
+ "predicted_away_score",
+ "predicted_total",
+ "predicted_margin",
+ "agrees_with_spread",
+ ]
+ ],
+ on=["home_team", "away_team"],
+ how="inner",
+ suffixes=("_ml", "_kenpom"),
+ )
+
+ print(f"Games in comparison: {len(merged)}")
+ print()
+
+ # 1. SPREAD AGREEMENT
+ print("=" * 100)
+ print("1. SPREAD PREDICTION AGREEMENT")
+ print("=" * 100)
+ print()
+
+ # ML model uses favorite_cover_prob > 0.5 as prediction
+ merged["ml_predicts_favorite"] = merged["favorite_cover_prob"] > 0.5
+ # KenPom column already named correctly after merge
+ merged["kenpom_predicts_favorite"] = merged["agrees_with_spread"]
+
+ ml_agree_count = merged["ml_predicts_favorite"].sum()
+ kenpom_agree_count = merged["kenpom_predicts_favorite"].sum()
+
+ ml_pct = ml_agree_count / len(merged) * 100
+ kp_pct = kenpom_agree_count / len(merged) * 100
+ print(f"ML Model agrees with spread: {ml_agree_count} / {len(merged)} ({ml_pct:.1f}%)")
+ print(f"KenPom agrees with spread: {kenpom_agree_count} / {len(merged)} ({kp_pct:.1f}%)")
+ print()
+
+ # Games where they disagree
+ disagreements = merged[merged["ml_predicts_favorite"] != merged["kenpom_predicts_favorite"]]
+ print(f"Games where ML and KenPom DISAGREE on winner: {len(disagreements)}")
+ if len(disagreements) > 0:
+ print()
+ for _, game in disagreements.iterrows():
+ print(f" {game['away_team']} @ {game['home_team']}")
+ print(f" Favorite: {game['favorite_team']} (-{game['spread_magnitude']})")
+ print(f" ML Model: Favorite cover prob = {game['favorite_cover_prob']:.1%}")
+ print(f" KenPom: Agrees = {game['kenpom_predicts_favorite']}")
+ print()
+
+ # 2. TOTAL PREDICTIONS
+ print("=" * 100)
+ print("2. TOTAL PREDICTIONS COMPARISON")
+ print("=" * 100)
+ print()
+
+ print("Average Predicted Totals:")
+ print(f" Betting Lines: {merged['total_points'].mean():.1f} points")
+ print(f" KenPom: {merged['predicted_total'].mean():.1f} points")
+ diff_mean = (merged["predicted_total"] - merged["total_points"]).mean()
+ print(f" Difference (KenPom - Betting Line): {diff_mean:.1f} points")
+ print()
+
+ # Show games with biggest total disagreement
+ merged["total_diff_models"] = merged["predicted_total"] - merged["total_points"]
+ print("Biggest Disagreements (KenPom vs Betting Line):")
+ print()
+ print("KenPom predicts HIGHER totals:")
+ top_higher = merged.nlargest(5, "total_diff_models")
+ for _, game in top_higher.iterrows():
+ print(f" {game['away_team']} @ {game['home_team']}")
+ ln, kp, diff = game["total_points"], game["predicted_total"], game["total_diff_models"]
+ print(f" Betting Line: {ln:.1f} | KenPom: {kp:.1f} | Diff: +{diff:.1f}")
+
+ print()
+ print("KenPom predicts LOWER totals:")
+ top_lower = merged.nsmallest(5, "total_diff_models")
+ for _, game in top_lower.iterrows():
+ print(f" {game['away_team']} @ {game['home_team']}")
+ ln, kp, diff = game["total_points"], game["predicted_total"], game["total_diff_models"]
+ print(f" Betting Line: {ln:.1f} | KenPom: {kp:.1f} | Diff: {diff:.1f}")
+
+ print()
+
+ # 3. OVER/UNDER PREDICTIONS
+ print("=" * 100)
+ print("3. OVER/UNDER PREDICTIONS")
+ print("=" * 100)
+ print()
+
+ # ML model predicts over if over_prob > 0.5
+ merged["ml_predicts_over"] = merged["over_prob"] > 0.5
+ merged["kenpom_predicts_over"] = merged["predicted_total"] > merged["total_points"]
+
+ ml_over_count = merged["ml_predicts_over"].sum()
+ kenpom_over_count = merged["kenpom_predicts_over"].sum()
+
+ ml_over_pct = ml_over_count / len(merged) * 100
+ kp_over_pct = kenpom_over_count / len(merged) * 100
+ print(f"ML Model predicts OVER: {ml_over_count} / {len(merged)} ({ml_over_pct:.1f}%)")
+ print(f"KenPom predicts OVER: {kenpom_over_count} / {len(merged)} ({kp_over_pct:.1f}%)")
+ print()
+
+ # Agreement on over/under
+ ou_agreement = (merged["ml_predicts_over"] == merged["kenpom_predicts_over"]).sum()
+ ou_pct = ou_agreement / len(merged) * 100
+ print(f"Models agree on Over/Under: {ou_agreement} / {len(merged)} ({ou_pct:.1f}%)")
+ print()
+
+ # Games where they disagree on O/U
+ ou_disagreements = merged[merged["ml_predicts_over"] != merged["kenpom_predicts_over"]]
+ print(f"Games where models DISAGREE on Over/Under: {len(ou_disagreements)}")
+ if len(ou_disagreements) > 0:
+ print()
+ for _, game in ou_disagreements.head(10).iterrows():
+ print(f" {game['away_team']} @ {game['home_team']}")
+ print(f" Betting Total: {game['total_points']:.1f}")
+ print(f" ML Model: Over prob = {game['over_prob']:.1%}")
+ print(f" KenPom Total: {game['predicted_total']:.1f}")
+ ml_pick = "OVER" if game["ml_predicts_over"] else "UNDER"
+ kenpom_pick = "OVER" if game["kenpom_predicts_over"] else "UNDER"
+ print(f" ML picks {ml_pick}, KenPom picks {kenpom_pick}")
+ print()
+
+ # 4. MODEL CONFIDENCE
+ print("=" * 100)
+ print("4. MODEL CONFIDENCE ANALYSIS")
+ print("=" * 100)
+ print()
+
+ print("ML Model Confidence (probability distributions):")
+ print(f" Avg Favorite Cover Prob: {merged['favorite_cover_prob'].mean():.1%}")
+ print(f" Avg Over Prob: {merged['over_prob'].mean():.1%}")
+ print(f" Std Dev Favorite Prob: {merged['favorite_cover_prob'].std():.3f}")
+ print(f" Most Confident Pick: {merged['favorite_cover_prob'].max():.1%}")
+ print(f" Least Confident Pick: {merged['favorite_cover_prob'].min():.1%}")
+ print()
+
+ print("KenPom Model (deterministic):")
+ print(f" Avg Margin: {merged['predicted_margin'].abs().mean():.1f} points")
+ print(f" Std Dev Margin: {merged['predicted_margin'].std():.1f}")
+ print(f" Largest Margin: {merged['predicted_margin'].abs().max():.1f} points")
+ print()
+
+ # 5. RECOMMENDED PICKS
+ print("=" * 100)
+ print("5. RECOMMENDED PICKS (Where Both Models Agree)")
+ print("=" * 100)
+ print()
+
+ # High confidence ML picks that KenPom also agrees with
+ high_conf_spread = merged[
+ (merged["ml_predicts_favorite"] == merged["kenpom_predicts_favorite"])
+ & (merged["favorite_cover_prob"] > 0.65)
+ ].sort_values("favorite_cover_prob", ascending=False)
+
+ print("SPREAD PICKS (Both models agree, ML confidence > 65%):")
+ if len(high_conf_spread) == 0:
+ print(" None found")
+ else:
+ for _, game in high_conf_spread.head(5).iterrows():
+ pick = game["favorite_team"] if game["ml_predicts_favorite"] else game["underdog_team"]
+ print(f" {game['away_team']} @ {game['home_team']}")
+ print(f" Pick: {pick}")
+ print(f" ML Confidence: {game['favorite_cover_prob']:.1%}")
+ kp_m = abs(game["predicted_margin"])
+ sp = game["spread_magnitude"]
+ print(f" KenPom Margin: {kp_m:.1f} vs Spread: {sp:.1f}")
+ print()
+
+ # Over/Under picks where both agree
+ ou_consensus = merged[merged["ml_predicts_over"] == merged["kenpom_predicts_over"]].copy()
+ ou_consensus["ml_ou_confidence"] = np.maximum(
+ ou_consensus["over_prob"], ou_consensus["under_prob"]
+ )
+
+ print("OVER/UNDER PICKS (Both models agree):")
+ for _, game in ou_consensus.nlargest(5, "ml_ou_confidence").iterrows():
+ pick = "OVER" if game["ml_predicts_over"] else "UNDER"
+ print(f" {game['away_team']} @ {game['home_team']}")
+ print(f" Pick: {pick} {game['total_points']:.1f}")
+ print(f" ML Confidence: {game['ml_ou_confidence']:.1%}")
+ kp_tot = game["predicted_total"]
+ line = game["total_points"]
+ print(f" KenPom Total: {kp_tot:.1f} vs Line: {line:.1f}")
+ print()
+
+ print("=" * 100)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/diagnose_score_models.py b/scripts/analysis/diagnose_score_models.py
new file mode 100644
index 000000000..d2ecbedc4
--- /dev/null
+++ b/scripts/analysis/diagnose_score_models.py
@@ -0,0 +1,125 @@
+#!/usr/bin/env python3
+"""Diagnostic script for score model performance investigation.
+
+Checks:
+1. Data availability (events, line features, team ratings)
+2. Feature coverage and completeness
+3. Model file existence and sizes
+4. Training data date ranges
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+
+
+def main() -> None:
+ """Run diagnostics."""
+ print("=" * 80)
+ print("SCORE MODEL DIAGNOSTICS")
+ print("=" * 80)
+
+ # Check staging data
+ print("\n1. STAGING DATA STATUS")
+ print("-" * 80)
+
+ events = read_parquet_df("data/staging/events.parquet")
+ line_features = read_parquet_df("data/staging/line_features.parquet")
+ team_ratings = read_parquet_df("data/staging/team_ratings.parquet")
+
+ print(f"Events: {len(events):5d} games")
+ print(f"With scores: {events['home_score'].notna().sum():5d} games")
+ print(f"Date range: {events['game_date'].min()} to {events['game_date'].max()}")
+ print(f"\nLine features: {len(line_features):5d} games")
+ print(f"Team ratings: {len(team_ratings):5d} teams")
+
+ if len(line_features) == 0:
+ print("\n[ERROR] Line features table is EMPTY!")
+ print("This prevents models from training properly.")
+ print("Fix: Run 'uv run python scripts/processing/consolidate_staging.py --force'")
+
+ # Check feature completeness
+ print("\n2. FEATURE COVERAGE")
+ print("-" * 80)
+
+ if len(line_features) > 0:
+ # Merge to check coverage
+ merged = events.merge(line_features, on="event_id", how="inner")
+ coverage = len(merged) / len(events) * 100
+ print(f"Games with both scores AND line features: {len(merged)} ({coverage:.1f}%)")
+
+ if coverage < 80:
+ print(f"\n[WARNING] Only {coverage:.1f}% of games have complete features")
+ print("Recommend: Collect more line data or use subset with complete data")
+ else:
+ print("Cannot compute coverage - line features table empty")
+
+ # Check model files
+ print("\n3. MODEL FILES")
+ print("-" * 80)
+
+ model_dir = Path("models")
+ home_model = model_dir / "home_score_2026.pkl"
+ away_model = model_dir / "away_score_2026.pkl"
+ features_file = model_dir / "score_features.txt"
+
+ if home_model.exists():
+ size_mb = home_model.stat().st_size / 1024 / 1024
+ print(f"Home model: {home_model} ({size_mb:.2f} MB)")
+ else:
+ print("[ERROR] Home model not found!")
+
+ if away_model.exists():
+ size_mb = away_model.stat().st_size / 1024 / 1024
+ print(f"Away model: {away_model} ({size_mb:.2f} MB)")
+ else:
+ print("[ERROR] Away model not found!")
+
+ if features_file.exists():
+ features = features_file.read_text().strip().split("\n")
+ print(f"Features: {len(features)} features")
+
+ # Check for line features
+ line_feats = [f for f in features if "total" in f.lower() or "line" in f.lower()]
+ if line_feats:
+ print(f"Line-related: {line_feats}")
+ else:
+ print("[WARNING] No line-related features found in feature set")
+ else:
+ print("[ERROR] Features file not found!")
+
+ # Summary
+ print("\n" + "=" * 80)
+ print("SUMMARY")
+ print("=" * 80)
+
+ issues = []
+ if len(line_features) == 0:
+ issues.append("Line features table is EMPTY (CRITICAL)")
+ if len(team_ratings) == 0:
+ issues.append("Team ratings table is EMPTY (CRITICAL)")
+ if len(events) < 600:
+ issues.append(f"Only {len(events)} games (recommend 600+)")
+
+ if issues:
+ print("\n[ISSUES FOUND]")
+ for i, issue in enumerate(issues, 1):
+ print(f"{i}. {issue}")
+ else:
+ print("\n[OK] All checks passed")
+
+ print("\nNext steps:")
+ if len(line_features) == 0:
+ print(" 1. Run: uv run python scripts/processing/consolidate_staging.py --force")
+ print(
+ ' 2. Verify line features: uv run python -c "from '
+ "sports_betting_edge.adapters.filesystem import read_parquet_df; "
+ "print(len(read_parquet_df('data/staging/line_features.parquet')))\""
+ )
+ print(" 3. Retrain models with complete data")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/game_day_outlook.py b/scripts/analysis/game_day_outlook.py
new file mode 100644
index 000000000..dcf011de9
--- /dev/null
+++ b/scripts/analysis/game_day_outlook.py
@@ -0,0 +1,274 @@
+"""Game day outlook: merge Overtime.ag live odds with model predictions.
+
+Usage:
+ uv run python scripts/analysis/game_day_outlook.py --date 2026-02-09
+"""
+
+from __future__ import annotations
+
+import argparse
+import sys
+from pathlib import Path
+
+import pandas as pd
+
+
+def load_overtime_odds(date_str: str) -> pd.DataFrame:
+ """Load latest OT snapshots for both College Basketball and College Extra."""
+ frames = []
+ for sub in ["college_basketball", "college_extra"]:
+ d = Path(f"data/source/overtime/api/{sub}")
+ files = sorted(d.glob(f"{sub}_{date_str}*.parquet"))
+ if files:
+ df = pd.read_parquet(files[-1])
+ frames.append(df[df["period_number"] == 0])
+
+ if not frames:
+ return pd.DataFrame()
+ return pd.concat(frames, ignore_index=True)
+
+
+ALIASES: dict[str, str] = {
+ "coll of charleston": "charleston",
+ "st johns": "st john",
+ "nc wilmington": "unc wilmington",
+ "st francis pa": "st francis",
+ "ark pine bluff": "arkansas pine bluff",
+ "prairie view a&m": "prairie view",
+ "bethune cookman": "bethune cookman",
+ "se louisiana": "se louisiana",
+ "mississippi valley state": "miss valley",
+ "mcneese state": "mcneese",
+ "nicholls state": "nicholls",
+ "texas a&m corpus": "texas am cc",
+ "east texas a&m": "east texas",
+ "northwestern state": "northwestern st",
+ "stephen f austin": "stephen f austin",
+ "murray state": "murray st",
+ "southern illinois": "southern illinois",
+ "indiana state": "indiana st",
+ "illinois state": "illinois st",
+ "jackson state": "jackson st",
+ "delaware state": "delaware st",
+ "alabama a&m": "alabama am",
+ "alabama state": "alabama st",
+ "north carolina central": "north carolina central",
+ "northern iowa": "northern iowa",
+ "incarnate word": "incarnate word",
+ "houston christian": "houston christian",
+ "ut rio grande valley": "ut rio grande valley",
+}
+
+
+def norm(name: str) -> str:
+ n = name.lower().replace(".", "").replace("'", "").replace("-", " ").strip()
+ for old, new in ALIASES.items():
+ if old in n:
+ return new
+ return n
+
+
+def merge_odds_and_predictions(ot: pd.DataFrame, preds: pd.DataFrame) -> list[dict]:
+ """Match OT odds to predictions by home team name."""
+ ot_lookup: dict[str, pd.Series] = {}
+ for _, row in ot.iterrows():
+ ot_lookup[norm(row["team2_id"])] = row
+
+ results = []
+ for _, p in preds.iterrows():
+ pn = norm(p["home_team"])
+ found = None
+ for ok, ov in ot_lookup.items():
+ if pn == ok or (len(pn) > 4 and len(ok) > 4 and (pn[:6] in ok or ok[:6] in pn)):
+ found = ov
+ break
+
+ pm = float(p["predicted_margin"])
+ fp = int(str(p["favorite_cover_prob"]).replace("%", ""))
+ dp = int(str(p["underdog_cover_prob"]).replace("%", ""))
+ op = int(str(p["over_prob"]).replace("%", ""))
+ up = int(str(p["under_prob"]).replace("%", ""))
+
+ r: dict = {
+ "home": p["home_team"],
+ "away": p["away_team"],
+ "pred_margin": pm,
+ "pred_total": float(p["predicted_total"]),
+ "fav": p["favorite_team"],
+ "mkt_spread": float(p["spread_magnitude"]),
+ "mkt_total": float(p["total_points"]),
+ "fp": fp,
+ "dp": dp,
+ "op": op,
+ "up": up,
+ "ot_spread": None,
+ "ot_fav": None,
+ "ot_total": None,
+ "ot_ml_home": None,
+ "ot_ml_away": None,
+ "game_time": None,
+ }
+
+ if found is not None:
+ r["ot_spread"] = found["spread_magnitude"]
+ r["ot_fav"] = found["favorite_team"]
+ r["ot_total"] = found["total_points"]
+ if pd.notna(found.get("moneyline2")):
+ r["ot_ml_home"] = int(found["moneyline2"])
+ if pd.notna(found.get("moneyline1")):
+ r["ot_ml_away"] = int(found["moneyline1"])
+ if pd.notna(found.get("game_datetime")):
+ r["game_time"] = str(found["game_datetime"])[:16]
+
+ results.append(r)
+
+ results.sort(key=lambda x: x["game_time"] or "9999")
+ return results
+
+
+def print_report(results: list[dict], date_str: str) -> None:
+ """Print formatted outlook report."""
+ print()
+ print("=" * 125)
+ print(f" {date_str} CBB GAME DAY OUTLOOK (Overtime.ag Live Odds + Model Predictions)")
+ print("=" * 125)
+
+ # Strong edges
+ strong_ats = []
+ strong_ou = []
+
+ for r in results:
+ ats_conf = max(r["fp"], r["dp"])
+ ou_conf = max(r["op"], r["up"])
+
+ if ats_conf >= 62:
+ if r["dp"] > r["fp"]:
+ pick = f"{r['away'][:20]} +{r['mkt_spread']}"
+ else:
+ pick = f"{r['fav'][:20]} cover"
+ strong_ats.append((r, pick, ats_conf))
+
+ if ou_conf >= 62:
+ line = r["ot_total"] or r["mkt_total"]
+ pick = f"OVER {line}" if r["op"] > r["up"] else f"UNDER {line}"
+ strong_ou.append((r, pick, ou_conf))
+
+ print()
+ print(f" STRONG SPREAD EDGES (>=62%) - {len(strong_ats)} picks")
+ print(" " + "-" * 95)
+ for r, pick, conf in sorted(strong_ats, key=lambda x: -x[2]):
+ ot_sp = ""
+ if r["ot_spread"] is not None and r["ot_fav"]:
+ ot_sp = f"{r['ot_fav'][:12]} -{r['ot_spread']:.1f}"
+ home_s = r["home"][:18]
+ away_s = r["away"][:18]
+ ml_str = ""
+ if r["dp"] > r["fp"] and r["ot_ml_away"]:
+ ml_str = f" (ML {r['ot_ml_away']:+d})"
+ elif r["fp"] > r["dp"] and r["ot_ml_home"]:
+ ml_str = f" (ML {r['ot_ml_home']:+d})"
+ print(
+ f" {home_s:<18s} vs {away_s:<18s}"
+ f" | OT: {ot_sp:<18s}"
+ f" | PICK: {pick:<28s} {conf}%{ml_str}"
+ )
+
+ print()
+ print(f" STRONG TOTALS EDGES (>=62%) - {len(strong_ou)} picks")
+ print(" " + "-" * 95)
+ for r, pick, conf in sorted(strong_ou, key=lambda x: -x[2]):
+ pred_t = r["pred_total"]
+ line = r["ot_total"] or r["mkt_total"]
+ diff = pred_t - line
+ home_s = r["home"][:18]
+ away_s = r["away"][:18]
+ print(
+ f" {home_s:<18s} vs {away_s:<18s}"
+ f" | Line: {line:<6.1f} Model: {pred_t:<6.1f} ({diff:+.1f})"
+ f" | PICK: {pick:<18s} {conf}%"
+ )
+
+ # Full game list
+ print()
+ print(" ALL GAMES")
+ print(" " + "-" * 118)
+ print(
+ f" {'Game':<42s} {'OT Spread':>14s} {'OT Tot':>7s}"
+ f" {'ML':>12s} {'Model':>7s} {'ATS':>12s} {'O/U':>12s}"
+ )
+ print(" " + "-" * 118)
+
+ for r in results:
+ home_s = r["home"][:18]
+ away_s = r["away"][:18]
+ game = f"{home_s:<18s} vs {away_s:<18s}"
+
+ # OT spread
+ if r["ot_spread"] is not None and r["ot_fav"]:
+ fav_is_home = r["ot_fav"].lower()[:5] in r["home"].lower()[:8]
+ ot_sp = f"-{r['ot_spread']:.1f}" if fav_is_home else f"+{r['ot_spread']:.1f}"
+ else:
+ fav_is_home = r["fav"] == r["home"]
+ ot_sp = f"-{r['mkt_spread']:.1f}" if fav_is_home else f"+{r['mkt_spread']:.1f}"
+
+ ot_tot = f"{r['ot_total']:.1f}" if r["ot_total"] else f"{r['mkt_total']:.1f}"
+
+ # ML
+ if r["ot_ml_home"] and r["ot_ml_away"]:
+ ml = f"{r['ot_ml_home']:+d}/{r['ot_ml_away']:+d}"
+ else:
+ ml = "--"
+
+ # ATS
+ ats_conf = max(r["fp"], r["dp"])
+ ats_label = "DOG" if r["dp"] > r["fp"] else "FAV"
+ ats = f"{ats_label} {ats_conf}%"
+ if ats_conf >= 62:
+ ats += " **"
+
+ # O/U
+ ou_conf = max(r["op"], r["up"])
+ ou_label = "OVER" if r["op"] > r["up"] else "UNDR"
+ ou = f"{ou_label} {ou_conf}%"
+ if ou_conf >= 62:
+ ou += " **"
+
+ print(
+ f" {game:<42s} {ot_sp:>14s} {ot_tot:>7s}"
+ f" {ml:>12s} {r['pred_margin']:>+7.1f} {ats:>12s} {ou:>12s}"
+ )
+
+ print(" " + "-" * 118)
+ print()
+ print(" ** = Strong edge (>=62% confidence)")
+ now_str = pd.Timestamp.now().strftime("%I:%M %p PT")
+ print(f" OT odds polled at {now_str} | Model: score_model_v3 (48 features)")
+ print()
+
+
+def main() -> int:
+ parser = argparse.ArgumentParser(description="Game day outlook")
+ parser.add_argument("--date", required=True, help="Date (YYYY-MM-DD)")
+ args = parser.parse_args()
+
+ pred_file = Path(f"predictions/{args.date}.csv")
+ if not pred_file.exists():
+ print(f"[ERROR] No predictions: {pred_file}")
+ return 1
+
+ preds = pd.read_csv(pred_file)
+ print(f"[OK] {len(preds)} predictions loaded")
+
+ ot = load_overtime_odds(args.date)
+ print(f"[OK] {len(ot)} OT game lines loaded")
+
+ results = merge_odds_and_predictions(ot, preds)
+ matched = sum(1 for r in results if r["ot_spread"] is not None)
+ print(f"[OK] {matched}/{len(results)} games matched to OT odds")
+
+ print_report(results, args.date)
+ return 0
+
+
+if __name__ == "__main__":
+ sys.exit(main())
diff --git a/scripts/analysis/grade_fanmatch.py b/scripts/analysis/grade_fanmatch.py
new file mode 100644
index 000000000..ec39dd588
--- /dev/null
+++ b/scripts/analysis/grade_fanmatch.py
@@ -0,0 +1,242 @@
+"""Grade KenPom FanMatch predictions against actual results and market lines.
+
+Usage:
+ uv run python scripts/analysis/grade_fanmatch.py --date 2026-02-09
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import os
+import sys
+from datetime import date, timedelta
+from typing import Any
+
+import httpx
+import pandas as pd
+from dotenv import load_dotenv
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.adapters.kenpom import KenPomAdapter
+from sports_betting_edge.config.settings import settings
+
+
+def fetch_completed_scores(target: str) -> dict[tuple[str, str], tuple[int, int]]:
+ """Fetch completed scores from Odds API for a given date."""
+ load_dotenv()
+ key = os.getenv("ODDS_API_KEY")
+ if not key:
+ print("[ERROR] ODDS_API_KEY not set")
+ sys.exit(1)
+
+ url = "https://api.the-odds-api.com/v4/sports/basketball_ncaab/scores"
+ resp = httpx.get(url, params={"apiKey": key, "daysFrom": 3}, timeout=30)
+ events = resp.json()
+ credits = resp.headers.get("x-requests-remaining", "?")
+ print(f"[OK] API credits remaining: {credits}")
+
+ scores: dict[tuple[str, str], tuple[int, int]] = {}
+ for ev in events:
+ if not ev.get("completed"):
+ continue
+ ct = ev.get("commence_time", "")
+ if target in ct:
+ pass
+ else:
+ parts = ct.split("T")
+ if len(parts) == 2 and parts[1][:2] < "08":
+ d = date.fromisoformat(parts[0])
+ prev = (d - timedelta(days=1)).isoformat()
+ if prev != target:
+ continue
+ else:
+ continue
+ hs = as_ = None
+ for s in ev.get("scores", []):
+ if s["name"] == ev["home_team"]:
+ hs = int(s["score"])
+ elif s["name"] == ev["away_team"]:
+ as_ = int(s["score"])
+ if hs is not None and as_ is not None:
+ scores[(ev["home_team"], ev["away_team"])] = (hs, as_)
+
+ print(f"[OK] {len(scores)} completed games for {target}")
+ return scores
+
+
+async def main(target_date: str) -> None:
+ """Grade FanMatch predictions for a date."""
+ # 1) Fetch FanMatch for target date AND previous day (UTC/ET boundary)
+ kp = KenPomAdapter()
+ try:
+ all_fm: list[dict[str, Any]] = []
+ target_dt = date.fromisoformat(target_date)
+ prev_date = target_dt - timedelta(days=1)
+ for d in [prev_date, target_dt]:
+ try:
+ batch = await kp.get_fanmatch(d.isoformat())
+ print(f"[OK] FanMatch {d}: {len(batch)} games")
+ all_fm.extend(batch)
+ except Exception as e:
+ print(f"[--] FanMatch {d}: {e}")
+ finally:
+ await kp.close()
+ fm_df = pd.DataFrame(all_fm)
+ print(f"[OK] {len(fm_df)} total FanMatch predictions")
+
+ # 2) Completed scores from live API
+ scores = fetch_completed_scores(target_date)
+
+ # 3) Closing odds from predictions CSV (already has spread/total from pipeline)
+ pred_path = settings.predictions_dir / f"{target_date}.csv"
+ if not pred_path.exists():
+ # Try previous day's predictions (evening games filed under next UTC day)
+ prev_pred = (
+ settings.predictions_dir
+ / f"{(date.fromisoformat(target_date) - timedelta(days=1)).isoformat()}.csv"
+ )
+ if prev_pred.exists():
+ pred_path = prev_pred
+ if pred_path.exists():
+ preds = pd.read_csv(pred_path)
+ else:
+ print(f"[WARNING] No predictions file found for {target_date}")
+ return
+
+ # 4) Team mapping
+ mapping = read_parquet_df(str(settings.staging_dir / "mappings" / "team_mapping.parquet"))
+ kp_to_odds: dict[str, str] = dict(
+ zip(mapping["kenpom_name"], mapping["odds_api_name"], strict=False)
+ )
+
+ # 5) Match FanMatch -> scores -> predictions (for market lines)
+ # Build lookup from predictions: (odds_home, odds_away) -> (home_spread, total)
+ pred_lookup: dict[tuple[str, str], tuple[float, float, str]] = {}
+ for _, p in preds.iterrows():
+ fav = p["favorite_team"]
+ mag = float(p["spread_magnitude"])
+ home_spread = -mag if fav == p["home_team"] else mag
+ pred_lookup[(p["home_team"], p["away_team"])] = (
+ home_spread,
+ float(p["total_points"]),
+ fav,
+ )
+
+ results = []
+ for _, fm_row in fm_df.iterrows():
+ home_kp = fm_row.get("Home", "")
+ away_kp = fm_row.get("Visitor", "")
+ home_odds = kp_to_odds.get(home_kp)
+ away_odds = kp_to_odds.get(away_kp)
+ if not home_odds or not away_odds:
+ continue
+ if (home_odds, away_odds) not in scores:
+ continue
+ hs, as_ = scores[(home_odds, away_odds)]
+
+ if (home_odds, away_odds) not in pred_lookup:
+ continue
+ home_spread, mkt_total, fav = pred_lookup[(home_odds, away_odds)]
+
+ results.append(
+ {
+ "home": home_kp,
+ "away": away_kp,
+ "home_spread": home_spread,
+ "spread_mag": abs(home_spread),
+ "favorite": fav,
+ "fm_margin": fm_row["HomePred"] - fm_row["VisitorPred"],
+ "fm_total": fm_row["HomePred"] + fm_row["VisitorPred"],
+ "mkt_total": mkt_total,
+ "actual_margin": hs - as_,
+ "actual_total": hs + as_,
+ }
+ )
+
+ rdf = pd.DataFrame(results)
+ if len(rdf) == 0:
+ print("[WARNING] No games matched FanMatch + scores + odds")
+ return
+
+ # Grade ATS (nullable boolean for pushes)
+ rdf["fm_ats"] = pd.array(
+ (rdf["fm_margin"] > rdf["home_spread"]) == (rdf["actual_margin"] > rdf["home_spread"]),
+ dtype=pd.BooleanDtype(),
+ )
+ rdf.loc[rdf["actual_margin"] == rdf["home_spread"], "fm_ats"] = pd.NA
+
+ # Grade O/U
+ rdf["fm_ou"] = pd.array(
+ (rdf["fm_total"] > rdf["mkt_total"]) == (rdf["actual_total"] > rdf["mkt_total"]),
+ dtype=pd.BooleanDtype(),
+ )
+ rdf.loc[rdf["actual_total"] == rdf["mkt_total"], "fm_ou"] = pd.NA
+
+ # Errors
+ rdf["fm_margin_err"] = abs(rdf["fm_margin"] - rdf["actual_margin"])
+ rdf["mkt_margin_err"] = abs(rdf["home_spread"] - rdf["actual_margin"])
+ rdf["fm_total_err"] = abs(rdf["fm_total"] - rdf["actual_total"])
+ rdf["mkt_total_err"] = abs(rdf["mkt_total"] - rdf["actual_total"])
+
+ # Print report
+ print(
+ f"\n{'=' * 100}\n"
+ f" FANMATCH GRADING REPORT - {target_date} ({len(rdf)} games)\n"
+ f"{'=' * 100}\n"
+ )
+ hdr = (
+ f"{'Game':<42} {'Line':>6} {'FM':>6} {'Act':>6} {'ATS':>5}"
+ f" {'Line':>6} {'FM':>6} {'Act':>6} {'O/U':>5}"
+ )
+ print(f"{'':42} {'--- SPREAD ---':^24} {'--- TOTALS ---':^24}")
+ print(hdr)
+ print("-" * len(hdr))
+
+ for _, r in rdf.iterrows():
+ ats_str = "HIT" if r["fm_ats"] is True else ("MISS" if r["fm_ats"] is False else "PUSH")
+ ou_str = "HIT" if r["fm_ou"] is True else ("MISS" if r["fm_ou"] is False else "PUSH")
+ label = f"{r['away']} @ {r['home']}"
+ print(
+ f"{label:<42} {r['home_spread']:>+6.1f} {r['fm_margin']:>+6.0f}"
+ f" {r['actual_margin']:>+6.0f} {ats_str:>5}"
+ f" {r['mkt_total']:>6.1f} {r['fm_total']:>6.0f}"
+ f" {r['actual_total']:>6.0f} {ou_str:>5}"
+ )
+ print("-" * len(hdr))
+
+ ats_v = rdf["fm_ats"].dropna()
+ ou_v = rdf["fm_ou"].dropna()
+
+ print("\n--- SPREAD PERFORMANCE ---")
+ print(
+ f"FM ATS Record: "
+ f"{int(ats_v.sum())}-{int(len(ats_v) - ats_v.sum())}"
+ f" ({ats_v.mean() * 100:.0f}%)"
+ )
+ print(f"FM Margin MAE: {rdf['fm_margin_err'].mean():.1f} pts")
+ print(f"Market Spread MAE: {rdf['mkt_margin_err'].mean():.1f} pts")
+ delta = rdf["fm_margin_err"].mean() - rdf["mkt_margin_err"].mean()
+ print(f"FM vs Market: {delta:+.1f} pts")
+
+ print("\n--- TOTALS PERFORMANCE ---")
+ print(
+ f"FM O/U Record: "
+ f"{int(ou_v.sum())}-{int(len(ou_v) - ou_v.sum())}"
+ f" ({ou_v.mean() * 100:.0f}%)"
+ )
+ print(f"FM Total MAE: {rdf['fm_total_err'].mean():.1f} pts")
+ print(f"Market Total MAE: {rdf['mkt_total_err'].mean():.1f} pts")
+ delta_t = rdf["fm_total_err"].mean() - rdf["mkt_total_err"].mean()
+ print(f"FM vs Market: {delta_t:+.1f} pts")
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Grade FanMatch predictions")
+ parser.add_argument(
+ "--date",
+ default=date.today().isoformat(),
+ help="Date to grade (YYYY-MM-DD)",
+ )
+ args = parser.parse_args()
+ asyncio.run(main(args.date))
diff --git a/scripts/analysis/grade_predictions.py b/scripts/analysis/grade_predictions.py
new file mode 100644
index 000000000..99d0b3148
--- /dev/null
+++ b/scripts/analysis/grade_predictions.py
@@ -0,0 +1,300 @@
+"""Grade predictions against actual scores from Odds API.
+
+Usage:
+ uv run python scripts/analysis/grade_predictions.py --date 2026-02-08
+"""
+
+from __future__ import annotations
+
+import argparse
+import csv
+import os
+import sys
+from pathlib import Path
+
+import httpx
+from dotenv import load_dotenv
+
+
+def fetch_scores(target_date: str) -> dict[str, tuple[int, int]]:
+ """Fetch completed scores from Odds API for games on target_date."""
+ load_dotenv()
+ key = os.getenv("ODDS_API_KEY")
+ if not key:
+ print("[ERROR] ODDS_API_KEY not set")
+ sys.exit(1)
+
+ url = "https://api.the-odds-api.com/v4/sports/basketball_ncaab/scores"
+ params = {"apiKey": key, "daysFrom": 3}
+ resp = httpx.get(url, params=params, timeout=30)
+ data = resp.json()
+
+ scores: dict[str, tuple[int, int]] = {}
+ for event in data:
+ if not event.get("completed"):
+ continue
+ ct = event.get("commence_time", "")
+ # Match date (late games may have next day UTC)
+ if target_date not in ct:
+ # Check if late game from target date (before 08:00 UTC next day)
+ next_parts = ct.split("T")
+ if len(next_parts) == 2 and next_parts[1][:2] < "08":
+ from datetime import date, timedelta
+
+ d = date.fromisoformat(next_parts[0])
+ prev = (d - timedelta(days=1)).isoformat()
+ if prev != target_date:
+ continue
+ else:
+ continue
+
+ home_score = away_score = None
+ for s in event.get("scores", []):
+ if s["name"] == event["home_team"]:
+ home_score = int(s["score"])
+ elif s["name"] == event["away_team"]:
+ away_score = int(s["score"])
+
+ if home_score is not None and away_score is not None:
+ norm_home = event["home_team"].lower().replace(".", "").replace("'", "")
+ scores[norm_home] = (home_score, away_score)
+
+ credits = resp.headers.get("x-requests-remaining", "unknown")
+ print(f"[OK] Fetched {len(scores)} completed games (API credits: {credits})")
+ return scores
+
+
+def load_predictions(pred_file: Path) -> list[dict[str, str]]:
+ """Load predictions CSV."""
+ preds = []
+ with open(pred_file) as f:
+ reader = csv.DictReader(f)
+ for row in reader:
+ preds.append(row)
+ return preds
+
+
+def normalize(name: str) -> str:
+ return name.lower().replace(".", "").replace("'", "").strip()
+
+
+def match_predictions(
+ preds: list[dict[str, str]], scores: dict[str, tuple[int, int]]
+) -> list[dict]:
+ """Match predictions to actual scores."""
+ results = []
+ for pred in preds:
+ home = pred["home_team"].strip()
+ nh = normalize(home)
+
+ # Exact match
+ matched_key = None
+ if nh in scores:
+ matched_key = nh
+ else:
+ # Partial match (first 10 chars)
+ for sk in scores:
+ if nh[:10] in sk or sk[:10] in nh:
+ matched_key = sk
+ break
+
+ if matched_key is None:
+ continue
+
+ hs, as_ = scores[matched_key]
+ results.append(
+ {
+ "home": home,
+ "away": pred["away_team"].strip(),
+ "pred_home": float(pred["predicted_home_score"]),
+ "pred_away": float(pred["predicted_away_score"]),
+ "pred_margin": float(pred["predicted_margin"]),
+ "pred_total": float(pred["predicted_total"]),
+ "actual_home": hs,
+ "actual_away": as_,
+ "actual_margin": hs - as_,
+ "actual_total": hs + as_,
+ "favorite": pred["favorite_team"],
+ "spread_mag": float(pred["spread_magnitude"]),
+ "total_line": float(pred["total_points"]),
+ "fav_cover_pct": int(pred["favorite_cover_prob"].replace("%", "")),
+ "dog_cover_pct": int(pred["underdog_cover_prob"].replace("%", "")),
+ "over_pct": int(pred["over_prob"].replace("%", "")),
+ "under_pct": int(pred["under_prob"].replace("%", "")),
+ }
+ )
+
+ return results
+
+
+def grade(results: list[dict]) -> None:
+ """Print grading report."""
+ print()
+ print("=" * 130)
+ print("PREDICTION GRADING REPORT")
+ print("=" * 130)
+ print()
+
+ header = (
+ f"{'Game':<48s} {'Spread':>7s} {'Pred':>6s} {'Act':>5s} "
+ f"{'ATS':>5s} {'Line':>6s} {'Pred':>6s} {'Act':>5s} {'O/U':>5s}"
+ )
+ print(header)
+ print("-" * 130)
+
+ spread_hits = 0
+ spread_total = 0
+ total_hits = 0
+ total_total = 0
+ margin_errors = []
+ total_errors = []
+ market_total_errors = []
+ strong_spread_hits = 0
+ strong_spread_total = 0
+ strong_total_hits = 0
+ strong_total_total = 0
+
+ for r in results:
+ game = f"{r['home'][:22]:<22s} vs {r['away'][:22]:<22s}"
+
+ # --- Spread grading ---
+ am = r["actual_margin"]
+ sm = r["spread_mag"]
+ fav = r["favorite"]
+
+ # Favorite margin (positive = favorite won by this much)
+ fav_margin = am if fav == r["home"] else -am
+
+ margin_errors.append(abs(r["pred_margin"] - am))
+
+ # Did favorite cover?
+ if fav_margin == sm:
+ spread_result = "PUSH"
+ elif fav_margin > sm:
+ fav_covered = True
+ spread_result = "FAV"
+ else:
+ fav_covered = False
+ spread_result = "DOG"
+
+ # Model pick
+ model_pick_fav = r["fav_cover_pct"] > r["dog_cover_pct"]
+
+ if spread_result == "PUSH":
+ ats_str = "PUSH"
+ else:
+ model_correct = (model_pick_fav and fav_covered) or (
+ not model_pick_fav and not fav_covered
+ )
+ ats_str = "HIT" if model_correct else "MISS"
+ spread_total += 1
+ if model_correct:
+ spread_hits += 1
+
+ # Strong picks (>= 62% confidence)
+ confidence = max(r["fav_cover_pct"], r["dog_cover_pct"])
+ if confidence >= 62:
+ strong_spread_total += 1
+ if model_correct:
+ strong_spread_hits += 1
+
+ # Display spread
+ spread_disp = f"-{sm}" if fav == r["home"] else f"+{sm}"
+
+ # --- Totals grading ---
+ at = r["actual_total"]
+ tl = r["total_line"]
+ pt = r["pred_total"]
+
+ total_errors.append(abs(pt - at))
+ market_total_errors.append(abs(tl - at))
+
+ if at == tl:
+ ou_str = "PUSH"
+ else:
+ actual_over = at > tl
+ model_said_over = r["over_pct"] > 50
+ total_correct = (model_said_over and actual_over) or (
+ not model_said_over and not actual_over
+ )
+ ou_str = "HIT" if total_correct else "MISS"
+ total_total += 1
+ if total_correct:
+ total_hits += 1
+
+ # Strong totals picks (>= 62%)
+ confidence = max(r["over_pct"], r["under_pct"])
+ if confidence >= 62:
+ strong_total_total += 1
+ if total_correct:
+ strong_total_hits += 1
+
+ print(
+ f"{game:<48s} {spread_disp:>7s} {r['pred_margin']:>+6.1f} "
+ f"{am:>+5d} {ats_str:>5s} {tl:>6.1f} {pt:>6.1f} "
+ f"{at:>5d} {ou_str:>5s}"
+ )
+
+ print("-" * 130)
+ print()
+
+ # Summary
+ print("=== SPREAD PERFORMANCE ===")
+ if spread_total > 0:
+ pct = spread_hits / spread_total * 100
+ print(f"All picks: {spread_hits}-{spread_total - spread_hits} ({pct:.0f}%)")
+ if strong_spread_total > 0:
+ spct = strong_spread_hits / strong_spread_total * 100
+ print(
+ f"Strong picks (>=62%): {strong_spread_hits}-"
+ f"{strong_spread_total - strong_spread_hits} ({spct:.0f}%)"
+ )
+ print(f"Margin MAE: {sum(margin_errors) / len(margin_errors):.1f} pts")
+ print()
+
+ print("=== TOTALS PERFORMANCE ===")
+ if total_total > 0:
+ tpct = total_hits / total_total * 100
+ print(f"All picks: {total_hits}-{total_total - total_hits} ({tpct:.0f}%)")
+ if strong_total_total > 0:
+ stpct = strong_total_hits / strong_total_total * 100
+ print(
+ f"Strong picks (>=62%): {strong_total_hits}-"
+ f"{strong_total_total - strong_total_hits} ({stpct:.0f}%)"
+ )
+ print(f"Model Total MAE: {sum(total_errors) / len(total_errors):.1f} pts")
+ print(f"Market Total MAE: {sum(market_total_errors) / len(market_total_errors):.1f} pts")
+ model_mae = sum(total_errors) / len(total_errors)
+ market_mae = sum(market_total_errors) / len(market_total_errors)
+ edge = market_mae - model_mae
+ print(f"Model vs Market: {'+' if edge > 0 else ''}{edge:.1f} pts")
+ print()
+
+
+def main() -> int:
+ parser = argparse.ArgumentParser(description="Grade predictions vs actuals")
+ parser.add_argument("--date", required=True, help="Date to grade (YYYY-MM-DD)")
+ args = parser.parse_args()
+
+ pred_file = Path(f"predictions/{args.date}.csv")
+ if not pred_file.exists():
+ print(f"[ERROR] No predictions file: {pred_file}")
+ return 1
+
+ preds = load_predictions(pred_file)
+ print(f"[OK] Loaded {len(preds)} predictions from {pred_file}")
+
+ scores = fetch_scores(args.date)
+ results = match_predictions(preds, scores)
+ print(f"[OK] Matched {len(results)} of {len(preds)} predicted games")
+
+ if not results:
+ print("[ERROR] No matches found")
+ return 1
+
+ grade(results)
+ return 0
+
+
+if __name__ == "__main__":
+ sys.exit(main())
diff --git a/scripts/analysis/market_vs_kenpom_season.py b/scripts/analysis/market_vs_kenpom_season.py
new file mode 100644
index 000000000..abb6405f1
--- /dev/null
+++ b/scripts/analysis/market_vs_kenpom_season.py
@@ -0,0 +1,676 @@
+"""Season-long comparison: Market closing lines vs KenPom FanMatch.
+
+Uses Odds API closing lines (proxy for Overtime/sharp market) as the
+benchmark and compares KenPom FanMatch predictions against them across
+the full season.
+
+Usage:
+ uv run python scripts/analysis/market_vs_kenpom_season.py --season 2026
+ uv run python scripts/analysis/market_vs_kenpom_season.py --start 2025-11-04 --end 2026-02-08
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+from datetime import date, datetime, timedelta
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.kenpom import KenPomAdapter
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+# ── Team name normalization ────────────────────────────────────────────
+
+KENPOM_TO_CANONICAL: dict[str, str] = {
+ "South Fla.": "South Florida",
+ "Penn St.": "Penn State",
+ "Ohio St.": "Ohio State",
+ "Mich. St.": "Michigan State",
+ "Michigan St.": "Michigan State",
+ "Miss. St.": "Mississippi State",
+ "N.C. State": "NC State",
+ "UNC Greensboro": "UNC Greensboro",
+ "UNCG": "UNC Greensboro",
+ "Wichita St.": "Wichita State",
+ "Boise St.": "Boise State",
+ "Fresno St.": "Fresno State",
+ "San Diego St.": "San Diego State",
+ "Colorado St.": "Colorado State",
+ "Utah St.": "Utah State",
+ "Kansas St.": "Kansas State",
+ "Iowa St.": "Iowa State",
+ "Oklahoma St.": "Oklahoma State",
+ "Oregon St.": "Oregon State",
+ "Washington St.": "Washington State",
+ "Arizona St.": "Arizona State",
+ "San Jose St.": "San Jose State",
+ "Jacksonville St.": "Jacksonville State",
+ "Kennesaw St.": "Kennesaw State",
+ "Morehead St.": "Morehead State",
+ "Murray St.": "Murray State",
+ "Appalachian St.": "Appalachian State",
+ "Portland St.": "Portland State",
+ "Sacramento St.": "Sacramento State",
+ "Weber St.": "Weber State",
+ "Montana St.": "Montana State",
+ "Ill.": "Illinois",
+ "Ind.": "Indiana",
+ "La.": "Louisiana",
+ "Ark.": "Arkansas",
+ "Col.": "Colorado",
+ "Conn.": "Connecticut",
+ "Del.": "Delaware",
+ "Fla.": "Florida",
+ "Ga.": "Georgia",
+ "Ky.": "Kentucky",
+ "Md.": "Maryland",
+ "Mass.": "Massachusetts",
+ "Mich.": "Michigan",
+ "Minn.": "Minnesota",
+ "Miss.": "Mississippi",
+ "Mo.": "Missouri",
+ "Neb.": "Nebraska",
+ "Nev.": "Nevada",
+ "Ore.": "Oregon",
+ "Tenn.": "Tennessee",
+ "Tex.": "Texas",
+ "Va.": "Virginia",
+ "Wash.": "Washington",
+ "Wis.": "Wisconsin",
+ "Wyo.": "Wyoming",
+ "UCF": "UCF",
+ "UAB": "UAB",
+ "UTEP": "UTEP",
+ "SMU": "SMU",
+ "LSU": "LSU",
+ "USC": "USC",
+ "BYU": "BYU",
+ "VCU": "VCU",
+ "UConn": "Connecticut",
+ "St. John's": "St. John's",
+}
+
+
+def normalize_kenpom_name(name: str) -> str:
+ """Normalize KenPom team name to match Odds API naming."""
+ if name in KENPOM_TO_CANONICAL:
+ return KENPOM_TO_CANONICAL[name]
+ return name
+
+
+def fuzzy_match(name1: str, name2: str) -> bool:
+ """Check if two team names likely refer to the same team."""
+ n1 = normalize_kenpom_name(name1).lower().strip()
+ n2 = name2.lower().strip()
+
+ if n1 == n2:
+ return True
+
+ # One contains the other
+ if n1 in n2 or n2 in n1:
+ return True
+
+ # Strip common suffixes for matching
+ suffixes = [
+ "bulldogs",
+ "tigers",
+ "bears",
+ "wolverines",
+ "buckeyes",
+ "red raiders",
+ "mountaineers",
+ "paladins",
+ "spartans",
+ "shockers",
+ "green wave",
+ "49ers",
+ "bearcats",
+ "knights",
+ "golden gophers",
+ "terrapins",
+ "blazers",
+ "owls",
+ "hawkeyes",
+ "wildcats",
+ "bulls",
+ "golden hurricane",
+ "nittany lions",
+ "trojans",
+ "cougars",
+ "gaels",
+ "dons",
+ "aggies",
+ "bruins",
+ "huskies",
+ "ducks",
+ "beavers",
+ "cardinals",
+ "buffaloes",
+ "sun devils",
+ "utes",
+ "rams",
+ "falcons",
+ "aztecs",
+ "rebels",
+ "wolf pack",
+ "broncos",
+ "cowboys",
+ "lobos",
+ "thunderbirds",
+ "texans",
+ "lancers",
+ "bobcats",
+ "eagles",
+ "lumberjacks",
+ "vikings",
+ "pilots",
+ "redhawks",
+ "toreros",
+ "waves",
+ "lions",
+ "dolphins",
+ "peacocks",
+ "red foxes",
+ "stags",
+ "jaspers",
+ "colonels",
+ "purple aces",
+ "bluejays",
+ "musketeers",
+ "friars",
+ "hoyas",
+ "pirates",
+ "boilermakers",
+ "fighting illini",
+ "badgers",
+ "hawkeyes",
+ "cornhuskers",
+ "gophers",
+ "blue devils",
+ "tar heels",
+ "cavaliers",
+ "hokies",
+ "demon deacons",
+ "orange",
+ "yellow jackets",
+ "hurricanes",
+ "seminoles",
+ "panthers",
+ "cardinals",
+ "volunteers",
+ "crimson tide",
+ "razorbacks",
+ "tigers",
+ "gamecocks",
+ "gators",
+ "commodores",
+ "jayhawks",
+ "sooners",
+ "longhorns",
+ "horned frogs",
+ "cyclones",
+ "red raiders",
+ "mountaineers",
+ "bears",
+ ]
+ n1_stripped = n1
+ n2_stripped = n2
+ for s in suffixes:
+ if n1_stripped.endswith(s):
+ n1_stripped = n1_stripped[: -len(s)].strip()
+ if n2_stripped.endswith(s):
+ n2_stripped = n2_stripped[: -len(s)].strip()
+
+ if n1_stripped and n2_stripped:
+ if n1_stripped == n2_stripped:
+ return True
+ if n1_stripped in n2_stripped or n2_stripped in n1_stripped:
+ return True
+
+ return False
+
+
+# ── Data loading ───────────────────────────────────────────────────────
+
+
+def load_staging_data() -> pd.DataFrame:
+ """Load events + line features from staging.
+
+ Returns merged DataFrame with game results and closing market lines.
+ """
+ events = pd.read_parquet("data/staging/events.parquet")
+ lines = pd.read_parquet("data/staging/line_features.parquet")
+
+ # Drop duplicate line entries (some events have 2 rows for fav/dog)
+ lines_deduped = lines.drop_duplicates(subset=["event_id"], keep="first")
+
+ merged = events.merge(lines_deduped, on="event_id", how="inner")
+ logger.info(
+ f"Staging data: {len(events)} events, {len(lines_deduped)} with lines, {len(merged)} merged"
+ )
+
+ # Compute market predicted margin (positive = home wins)
+ # closing_spread is magnitude, favorite_team tells direction
+ def market_margin(row: pd.Series) -> float | None:
+ if pd.isna(row["closing_spread"]) or pd.isna(row["favorite_team"]):
+ return None
+ mag = float(row["closing_spread"])
+ if fuzzy_match(row["favorite_team"], row["home_team"]):
+ return mag # Home favored -> positive margin
+ else:
+ return -mag # Away favored -> negative margin
+
+ merged["market_predicted_margin"] = merged.apply(market_margin, axis=1)
+ merged["actual_margin"] = merged["home_score"] - merged["away_score"]
+ merged["actual_total"] = merged["home_score"] + merged["away_score"]
+
+ return merged
+
+
+async def fetch_kenpom_season(start_date: date, end_date: date) -> pd.DataFrame:
+ """Fetch KenPom FanMatch for entire season.
+
+ Args:
+ start_date: Season start
+ end_date: Season end (inclusive)
+
+ Returns:
+ DataFrame with all KenPom predictions
+ """
+ kenpom = KenPomAdapter()
+ all_rows: list[dict] = []
+ current = start_date
+ n_dates = (end_date - start_date).days + 1
+ fetched = 0
+
+ try:
+ while current <= end_date:
+ date_str = current.isoformat()
+ try:
+ games = await kenpom.get_fanmatch(date_str)
+ if games:
+ for g in games:
+ home = g.get("Home", "")
+ visitor = g.get("Visitor", "")
+ home_pred = g.get("HomePred")
+ visitor_pred = g.get("VisitorPred")
+
+ kp_margin = None
+ kp_total = None
+ if home_pred is not None and visitor_pred is not None:
+ kp_margin = float(home_pred) - float(visitor_pred)
+ kp_total = float(home_pred) + float(visitor_pred)
+
+ all_rows.append(
+ {
+ "game_date": date_str,
+ "kp_home": home,
+ "kp_visitor": visitor,
+ "kp_home_pred": home_pred,
+ "kp_visitor_pred": visitor_pred,
+ "kp_predicted_margin": kp_margin,
+ "kp_predicted_total": kp_total,
+ "kp_home_wp": g.get("HomeWP"),
+ "kp_predicted_winner": g.get("PredictedWinner"),
+ }
+ )
+ fetched += 1
+ if fetched % 20 == 0:
+ logger.info(
+ f" Fetched {fetched}/{n_dates} dates ({len(all_rows)} games)..."
+ )
+ except Exception as e:
+ logger.debug(f" No data for {date_str}: {e}")
+
+ current += timedelta(days=1)
+ finally:
+ await kenpom.close()
+
+ df = pd.DataFrame(all_rows)
+ logger.info(f"KenPom FanMatch: {len(df)} predictions across {fetched} dates")
+ return df
+
+
+def match_kenpom_to_market(market_df: pd.DataFrame, kenpom_df: pd.DataFrame) -> pd.DataFrame:
+ """Match KenPom predictions to market games by date + team names.
+
+ Returns:
+ Merged DataFrame with market, KenPom, and actual data
+ """
+ matched_count = 0
+ kp_cols = [
+ "kp_home",
+ "kp_visitor",
+ "kp_home_pred",
+ "kp_visitor_pred",
+ "kp_predicted_margin",
+ "kp_predicted_total",
+ "kp_home_wp",
+ ]
+
+ # Initialize KenPom columns
+ for col in kp_cols:
+ market_df[col] = None
+
+ for idx, mkt in market_df.iterrows():
+ game_date = str(mkt["game_date"])
+ home = mkt["home_team"]
+ away = mkt["away_team"]
+
+ # Find matching KenPom game
+ date_games = kenpom_df[kenpom_df["game_date"] == game_date]
+
+ for _, kp in date_games.iterrows():
+ if fuzzy_match(kp["kp_home"], home) and fuzzy_match(kp["kp_visitor"], away):
+ for col in kp_cols:
+ market_df.at[idx, col] = kp[col]
+ matched_count += 1
+ break
+
+ logger.info(
+ f"Matched {matched_count}/{len(market_df)} games ({matched_count / len(market_df):.1%})"
+ )
+ return market_df
+
+
+# ── Analysis ───────────────────────────────────────────────────────────
+
+
+def run_analysis(df: pd.DataFrame) -> dict:
+ """Run comprehensive accuracy analysis.
+
+ Returns dict of metrics for downstream use.
+ """
+ # Filter to games with both market and KenPom predictions
+ has_market = df["market_predicted_margin"].notna()
+ has_kp = df["kp_predicted_margin"].notna()
+ has_actual = df["actual_margin"].notna()
+ complete = has_market & has_kp & has_actual
+ m = df[complete].copy()
+ n = len(m)
+
+ if n == 0:
+ print("[ERROR] No games with all three data sources")
+ return {}
+
+ print(f"\n{'=' * 90}")
+ print(f"SEASON COMPARISON: MARKET vs KENPOM ({n} games)")
+ print(f"{'=' * 90}")
+ print(f"Date range: {m['game_date'].min()} to {m['game_date'].max()}")
+
+ # ── Margin accuracy ──
+ m["mkt_margin_err"] = (m["market_predicted_margin"] - m["actual_margin"]).abs()
+ m["kp_margin_err"] = (m["kp_predicted_margin"] - m["actual_margin"]).abs()
+
+ mkt_mae = m["mkt_margin_err"].mean()
+ kp_mae = m["kp_margin_err"].mean()
+
+ print("\n--- SPREAD/MARGIN ACCURACY ---")
+ print(f" Market margin MAE: {mkt_mae:.2f} pts")
+ print(f" KenPom margin MAE: {kp_mae:.2f} pts")
+ diff = mkt_mae - kp_mae
+ winner = "KenPom" if diff > 0 else "Market"
+ print(f" -> {winner} closer by {abs(diff):.2f} pts")
+
+ # Bias
+ mkt_bias = (m["market_predicted_margin"] - m["actual_margin"]).mean()
+ kp_bias = (m["kp_predicted_margin"] - m["actual_margin"]).mean()
+ print(f"\n Market margin bias: {mkt_bias:+.2f} (predicted - actual)")
+ print(f" KenPom margin bias: {kp_bias:+.2f} (predicted - actual)")
+
+ # Correct winner
+ mkt_winner_correct = ((m["market_predicted_margin"] > 0) == (m["actual_margin"] > 0)).mean()
+ kp_winner_correct = ((m["kp_predicted_margin"] > 0) == (m["actual_margin"] > 0)).mean()
+ print(f"\n Market correct winner: {mkt_winner_correct:.1%}")
+ print(f" KenPom correct winner: {kp_winner_correct:.1%}")
+
+ # Who was closer more often?
+ kp_closer = (m["kp_margin_err"] < m["mkt_margin_err"]).mean()
+ print(f"\n KenPom closer than market: {kp_closer:.1%} of games")
+
+ # ATS: KenPom vs market spread
+ cover = m["actual_margin"] - m["market_predicted_margin"]
+ kp_edge = m["kp_predicted_margin"] - m["market_predicted_margin"]
+ kp_ats = ((kp_edge > 0) & (cover > 0)) | ((kp_edge < 0) & (cover < 0))
+ # Exclude games where KP agrees with market (no edge)
+ has_edge = kp_edge.abs() > 0.5
+ if has_edge.any():
+ kp_ats_rate = kp_ats[has_edge].mean()
+ print(
+ f"\n KenPom ATS vs market: {kp_ats_rate:.1%} "
+ f"({has_edge.sum()} games where KP disagrees)"
+ )
+
+ # ── Total accuracy ──
+ has_total = m["closing_total"].notna() & m["kp_predicted_total"].notna()
+ mt = m[has_total].copy()
+ nt = len(mt)
+
+ if nt > 0:
+ mt["mkt_total_err"] = (mt["closing_total"] - mt["actual_total"]).abs()
+ mt["kp_total_err"] = (mt["kp_predicted_total"] - mt["actual_total"]).abs()
+
+ print(f"\n--- TOTALS ACCURACY ({nt} games) ---")
+ mkt_total_mae = mt["mkt_total_err"].mean()
+ kp_total_mae = mt["kp_total_err"].mean()
+ print(f" Market total MAE: {mkt_total_mae:.2f} pts")
+ print(f" KenPom total MAE: {kp_total_mae:.2f} pts")
+ diff = mkt_total_mae - kp_total_mae
+ winner = "KenPom" if diff > 0 else "Market"
+ print(f" -> {winner} closer by {abs(diff):.2f} pts")
+
+ mkt_total_bias = (mt["closing_total"] - mt["actual_total"]).mean()
+ kp_total_bias = (mt["kp_predicted_total"] - mt["actual_total"]).mean()
+ print(f"\n Market total bias: {mkt_total_bias:+.2f}")
+ print(f" KenPom total bias: {kp_total_bias:+.2f}")
+
+ # O/U accuracy when KenPom disagrees with market
+ ou_edge = mt["kp_predicted_total"] - mt["closing_total"]
+ actual_ou = mt["actual_total"] - mt["closing_total"]
+ kp_ou = ((ou_edge > 0) & (actual_ou > 0)) | ((ou_edge < 0) & (actual_ou < 0))
+ has_ou_edge = ou_edge.abs() > 0.5
+ if has_ou_edge.any():
+ print(
+ f"\n KenPom O/U vs market: {kp_ou[has_ou_edge].mean():.1%} "
+ f"({has_ou_edge.sum()} games where KP disagrees)"
+ )
+
+ # KenPom closer more often on totals?
+ kp_total_closer = (mt["kp_total_err"] < mt["mkt_total_err"]).mean()
+ print(f" KenPom closer on totals: {kp_total_closer:.1%}")
+
+ # ── Disagreement analysis ──
+ m["margin_disagree"] = (m["kp_predicted_margin"] - m["market_predicted_margin"]).abs()
+
+ print("\n--- DISAGREEMENT ANALYSIS ---")
+ print(f" Avg margin disagreement: {m['margin_disagree'].mean():.2f} pts")
+ print(f" Median: {m['margin_disagree'].median():.2f} pts")
+ print(f" 90th percentile: {m['margin_disagree'].quantile(0.9):.2f} pts")
+
+ # Bucket by disagreement level
+ for threshold in [2.0, 3.0, 5.0, 7.0]:
+ big_disagree = m[m["margin_disagree"] >= threshold]
+ if len(big_disagree) >= 5:
+ kp_better_pct = (big_disagree["kp_margin_err"] < big_disagree["mkt_margin_err"]).mean()
+ # ATS when KP disagrees significantly
+ cover_sub = big_disagree["actual_margin"] - big_disagree["market_predicted_margin"]
+ kp_dir = big_disagree["kp_predicted_margin"] - big_disagree["market_predicted_margin"]
+ ats_hit = ((kp_dir > 0) & (cover_sub > 0)) | ((kp_dir < 0) & (cover_sub < 0))
+ print(f"\n Disagreement >= {threshold:.0f} pts ({len(big_disagree)} games):")
+ print(f" KenPom closer: {kp_better_pct:.1%}")
+ print(f" KenPom ATS: {ats_hit.mean():.1%}")
+ print(f" Market MAE: {big_disagree['mkt_margin_err'].mean():.2f}")
+ print(f" KenPom MAE: {big_disagree['kp_margin_err'].mean():.2f}")
+
+ # ── Blended predictions ──
+ print("\n--- BLENDED PREDICTIONS ---")
+ for w in [0.25, 0.50, 0.75]:
+ blend = w * m["kp_predicted_margin"] + (1 - w) * m["market_predicted_margin"]
+ blend_mae = (blend - m["actual_margin"]).abs().mean()
+ print(f" {w:.0%} KP + {1 - w:.0%} Market: MAE = {blend_mae:.2f}")
+
+ # ── Monthly breakdown ──
+ m["month"] = pd.to_datetime(m["game_date"]).dt.to_period("M")
+ print("\n--- MONTHLY BREAKDOWN ---")
+ print(
+ f" {'Month':<10} {'N':>5} {'Mkt MAE':>10} {'KP MAE':>10} {'KP closer':>10} {'KP ATS':>8}"
+ )
+ print(f" {'-' * 55}")
+
+ for month, grp in m.groupby("month"):
+ mkt_m = grp["mkt_margin_err"].mean()
+ kp_m = grp["kp_margin_err"].mean()
+ kp_cl = (grp["kp_margin_err"] < grp["mkt_margin_err"]).mean()
+ cover_g = grp["actual_margin"] - grp["market_predicted_margin"]
+ kp_dir_g = grp["kp_predicted_margin"] - grp["market_predicted_margin"]
+ ats_g = ((kp_dir_g > 0) & (cover_g > 0)) | ((kp_dir_g < 0) & (cover_g < 0))
+ has_e = kp_dir_g.abs() > 0.5
+ ats_rate = ats_g[has_e].mean() if has_e.any() else float("nan")
+ print(
+ f" {str(month):<10} {len(grp):>5} {mkt_m:>10.2f} {kp_m:>10.2f} "
+ f"{kp_cl:>10.1%} {ats_rate:>8.1%}"
+ )
+
+ # ── Spread magnitude buckets ──
+ m["spread_bucket"] = pd.cut(
+ m["market_predicted_margin"].abs(),
+ bins=[0, 3, 7, 12, 50],
+ labels=["<3", "3-7", "7-12", "12+"],
+ )
+ print("\n--- BY SPREAD SIZE ---")
+ print(f" {'Bucket':<10} {'N':>5} {'Mkt MAE':>10} {'KP MAE':>10} {'KP closer':>10}")
+ print(f" {'-' * 47}")
+
+ for bucket, grp in m.groupby("spread_bucket", observed=True):
+ if len(grp) < 3:
+ continue
+ mkt_b = grp["mkt_margin_err"].mean()
+ kp_b = grp["kp_margin_err"].mean()
+ kp_cl = (grp["kp_margin_err"] < grp["mkt_margin_err"]).mean()
+ print(f" {str(bucket):<10} {len(grp):>5} {mkt_b:>10.2f} {kp_b:>10.2f} {kp_cl:>10.1%}")
+
+ # ── Summary table ──
+ print(f"\n{'=' * 90}")
+ print("FINAL SUMMARY")
+ print(f"{'=' * 90}")
+ print(f"\n Games analyzed: {n}")
+ print(f" Date range: {m['game_date'].min()} to {m['game_date'].max()}")
+ print(f"\n {'Metric':<30} {'Market':>12} {'KenPom':>12} {'Better':>10}")
+ print(f" {'-' * 66}")
+ print(
+ f" {'Margin MAE':<30} {mkt_mae:>12.2f} {kp_mae:>12.2f} "
+ f"{'KenPom' if kp_mae < mkt_mae else 'Market':>10}"
+ )
+ print(
+ f" {'Margin Bias':<30} {mkt_bias:>+12.2f} {kp_bias:>+12.2f} "
+ f"{'KenPom' if abs(kp_bias) < abs(mkt_bias) else 'Market':>10}"
+ )
+ print(
+ f" {'Correct Winner':<30} {mkt_winner_correct:>12.1%} "
+ f"{kp_winner_correct:>12.1%} "
+ f"{'KenPom' if kp_winner_correct > mkt_winner_correct else 'Market':>10}"
+ )
+ if nt > 0:
+ print(
+ f" {'Total MAE':<30} {mkt_total_mae:>12.2f} {kp_total_mae:>12.2f} "
+ f"{'KenPom' if kp_total_mae < mkt_total_mae else 'Market':>10}"
+ )
+ print(
+ f" {'Total Bias':<30} {mkt_total_bias:>+12.2f} {kp_total_bias:>+12.2f} "
+ f"{'KenPom' if abs(kp_total_bias) < abs(mkt_total_bias) else 'Market':>10}"
+ )
+ print()
+
+ return {
+ "n_games": n,
+ "market_margin_mae": mkt_mae,
+ "kenpom_margin_mae": kp_mae,
+ "market_margin_bias": mkt_bias,
+ "kenpom_margin_bias": kp_bias,
+ "market_winner_pct": mkt_winner_correct,
+ "kenpom_winner_pct": kp_winner_correct,
+ "kenpom_closer_pct": kp_closer,
+ }
+
+
+# ── Main ───────────────────────────────────────────────────────────────
+
+
+async def main_async(args: argparse.Namespace) -> None:
+ """Run the season-long comparison."""
+ logger.info("[OK] Season-long Market vs KenPom Comparison\n")
+
+ # Determine date range
+ if args.season:
+ start_date = date(args.season - 1, 11, 1)
+ end_date = date.today() - timedelta(days=1)
+ else:
+ start_date = datetime.strptime(args.start, "%Y-%m-%d").date()
+ end_date = datetime.strptime(args.end, "%Y-%m-%d").date()
+
+ logger.info(f"Date range: {start_date} to {end_date}")
+
+ # Load market data
+ market_df = load_staging_data()
+
+ # Filter to date range
+ market_df["game_date"] = market_df["game_date"].astype(str)
+ market_df = market_df[
+ (market_df["game_date"] >= str(start_date)) & (market_df["game_date"] <= str(end_date))
+ ].copy()
+ logger.info(f"Market games in range: {len(market_df)}")
+
+ # Fetch KenPom
+ kenpom_df = await fetch_kenpom_season(start_date, end_date)
+
+ if kenpom_df.empty:
+ logger.error("No KenPom data fetched")
+ return
+
+ # Cache KenPom data
+ cache_path = Path(f"data/kenpom/fanmatch/fanmatch_season_{args.season or 'custom'}.parquet")
+ cache_path.parent.mkdir(parents=True, exist_ok=True)
+ kenpom_df.to_parquet(cache_path)
+ logger.info(f"Cached KenPom data to {cache_path}")
+
+ # Match
+ merged = match_kenpom_to_market(market_df, kenpom_df)
+
+ # Analyze
+ run_analysis(merged)
+
+ # Save detailed results
+ if args.output:
+ args.output.parent.mkdir(parents=True, exist_ok=True)
+ merged.to_parquet(args.output)
+ logger.info(f"[OK] Saved detailed results to {args.output}")
+
+
+def main() -> None:
+ """Entry point."""
+ parser = argparse.ArgumentParser(description="Season-long Market vs KenPom comparison")
+ parser.add_argument("--season", type=int, default=None, help="Season year")
+ parser.add_argument("--start", type=str, default=None, help="Start date")
+ parser.add_argument("--end", type=str, default=None, help="End date")
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=Path("data/reports/market_vs_kenpom_season.parquet"),
+ )
+ args = parser.parse_args()
+
+ if args.season is None and (args.start is None or args.end is None):
+ parser.error("Either --season or both --start/--end required")
+
+ asyncio.run(main_async(args))
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/performance_dashboard.py b/scripts/analysis/performance_dashboard.py
new file mode 100644
index 000000000..d088729d4
--- /dev/null
+++ b/scripts/analysis/performance_dashboard.py
@@ -0,0 +1,89 @@
+"""CLI script for prediction performance dashboard.
+
+Displays cumulative performance, rolling metrics, calibration,
+CLV, and model health in a formatted terminal report.
+
+Usage:
+ uv run python scripts/analysis/performance_dashboard.py
+ uv run python scripts/analysis/performance_dashboard.py --as-of 2026-02-09
+ uv run python scripts/analysis/performance_dashboard.py --json data/reports/dashboard.json
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from datetime import date
+from pathlib import Path
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.services.performance_dashboard import (
+ PerformanceDashboard,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def main() -> None:
+ """Entry point for performance dashboard CLI."""
+ parser = argparse.ArgumentParser(description="Prediction performance dashboard")
+ parser.add_argument(
+ "--as-of",
+ type=str,
+ default=None,
+ help="Dashboard as-of date (default: today)",
+ )
+ parser.add_argument(
+ "--json",
+ type=Path,
+ default=None,
+ help="Export dashboard data to JSON file",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Database path",
+ )
+ parser.add_argument(
+ "--models-dir",
+ type=Path,
+ default=Path("models"),
+ help="Models directory",
+ )
+ args = parser.parse_args()
+
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s %(levelname)s %(message)s",
+ )
+
+ as_of = args.as_of or date.today().isoformat()
+
+ db = OddsAPIDatabase(str(args.db_path))
+ dashboard = PerformanceDashboard(db)
+
+ # JSON export mode
+ if args.json:
+ dashboard.export_json(as_of, args.json)
+ print(f"[OK] Dashboard exported to {args.json}")
+ return
+
+ # Build all dashboard components
+ snapshots = dashboard.build_snapshots(as_of)
+ calibration = dashboard.build_calibration_curve("2025-11-01", as_of)
+ confidence = dashboard.build_confidence_breakdown("2025-11-01", as_of)
+ model_health = dashboard.build_model_health(args.models_dir)
+
+ # Render and print
+ report = dashboard.render_cli_report(
+ snapshots=snapshots,
+ calibration=calibration,
+ confidence=confidence,
+ model_health=model_health,
+ )
+ print(report)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/validate_data_quality.py b/scripts/analysis/validate_data_quality.py
new file mode 100644
index 000000000..e0f271649
--- /dev/null
+++ b/scripts/analysis/validate_data_quality.py
@@ -0,0 +1,593 @@
+"""Validate data quality across all sources for CLV tracking and backtesting.
+
+Checks:
+1. Team name mapping coverage (Odds API, ESPN, KenPom)
+2. Temporal field consistency (timezone, format, data types)
+3. Database schema alignment
+4. Data integrity (orphaned records, missing scores)
+5. Date range coverage
+
+Usage:
+ uv run python scripts/validate_data_quality.py
+
+ # Check specific date range
+ uv run python scripts/validate_data_quality.py --start 2025-12-01 --end 2026-01-31
+
+ # Export detailed report
+ uv run python scripts/validate_data_quality.py --output data/validation_report.txt
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from collections import defaultdict
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.core.team_mapper import TeamMapper
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+class DataQualityValidator:
+ """Validate data quality across all sources."""
+
+ def __init__(
+ self,
+ odds_db_path: Path,
+ kenpom_path: Path,
+ team_mapping_path: Path,
+ ):
+ """Initialize validator.
+
+ Args:
+ odds_db_path: Path to Odds API SQLite database
+ kenpom_path: Path to KenPom data directory
+ team_mapping_path: Path to team mapping parquet file
+ """
+ self.odds_db_path = odds_db_path
+ self.kenpom_path = kenpom_path
+ self.team_mapping_path = team_mapping_path
+ self.db = OddsAPIDatabase(str(odds_db_path))
+
+ self.issues: dict[str, list[str]] = defaultdict(list)
+ self.stats: dict[str, int | float | str] = {}
+
+ def validate_all(
+ self,
+ start_date: str | None = None,
+ end_date: str | None = None,
+ ) -> dict[str, Any]:
+ """Run all validation checks.
+
+ Args:
+ start_date: Optional start date for temporal checks (YYYY-MM-DD)
+ end_date: Optional end date for temporal checks (YYYY-MM-DD)
+
+ Returns:
+ Dictionary with validation results
+ """
+ logger.info("=" * 80)
+ logger.info("DATA QUALITY VALIDATION")
+ logger.info("=" * 80)
+
+ # 1. Database schema validation
+ logger.info("\n[1/6] Validating database schema...")
+ self._validate_database_schema()
+
+ # 2. Temporal field validation
+ logger.info("\n[2/6] Validating temporal fields...")
+ self._validate_temporal_fields(start_date, end_date)
+
+ # 3. Team name mapping validation
+ logger.info("\n[3/6] Validating team name mappings...")
+ self._validate_team_mappings()
+
+ # 4. Data integrity checks
+ logger.info("\n[4/6] Checking data integrity...")
+ self._validate_data_integrity()
+
+ # 5. KenPom data validation
+ logger.info("\n[5/6] Validating KenPom data...")
+ self._validate_kenpom_data()
+
+ # 6. Date range coverage
+ logger.info("\n[6/6] Checking date range coverage...")
+ self._validate_date_coverage()
+
+ # Generate summary
+ self._print_summary()
+
+ return {
+ "issues": dict(self.issues),
+ "stats": self.stats,
+ "passed": len(self.issues) == 0,
+ }
+
+ def _validate_database_schema(self) -> None:
+ """Validate SQLite database schema matches expectations."""
+ # Check events table
+ events_schema = pd.read_sql_query("PRAGMA table_info(events)", self.db.conn).set_index(
+ "name"
+ )
+
+ expected_events = {
+ "event_id": "TEXT",
+ "sport_key": "TEXT",
+ "home_team": "TEXT",
+ "away_team": "TEXT",
+ "commence_time": "TEXT",
+ }
+
+ for col, dtype in expected_events.items():
+ if col not in events_schema.index:
+ self.issues["schema"].append(f"Missing column in events: {col}")
+ elif events_schema.loc[col, "type"] != dtype:
+ self.issues["schema"].append(
+ f"Wrong type for events.{col}: "
+ f"expected {dtype}, got {events_schema.loc[col, 'type']}"
+ )
+
+ # Check observations table
+ obs_schema = pd.read_sql_query("PRAGMA table_info(observations)", self.db.conn).set_index(
+ "name"
+ )
+
+ expected_obs = {
+ "event_id": "TEXT",
+ "book_key": "TEXT",
+ "market_key": "TEXT",
+ "outcome_name": "TEXT",
+ "price_american": "INTEGER",
+ "point": "REAL",
+ "as_of": "TEXT",
+ "sport_key": "TEXT",
+ }
+
+ for col, _dtype in expected_obs.items():
+ if col not in obs_schema.index:
+ self.issues["schema"].append(f"Missing column in observations: {col}")
+
+ # Check scores table
+ scores_schema = pd.read_sql_query("PRAGMA table_info(scores)", self.db.conn).set_index(
+ "name"
+ )
+
+ expected_scores = {
+ "event_id": "TEXT",
+ "sport_key": "TEXT",
+ "completed": "INTEGER",
+ "home_score": "INTEGER",
+ "away_score": "INTEGER",
+ }
+
+ for col, _dtype in expected_scores.items():
+ if col not in scores_schema.index:
+ self.issues["schema"].append(f"Missing column in scores: {col}")
+
+ self.stats["events_columns"] = len(events_schema)
+ self.stats["observations_columns"] = len(obs_schema)
+ self.stats["scores_columns"] = len(scores_schema)
+
+ logger.info(f" Events table: {len(events_schema)} columns")
+ logger.info(f" Observations table: {len(obs_schema)} columns")
+ logger.info(f" Scores table: {len(scores_schema)} columns")
+
+ def _validate_temporal_fields(
+ self, start_date: str | None = None, end_date: str | None = None
+ ) -> None:
+ """Validate temporal fields for timezone and format consistency."""
+ # Check commence_time format and timezone
+ query = "SELECT event_id, commence_time FROM events LIMIT 1000"
+ if start_date and end_date:
+ query = f"""
+ SELECT event_id, commence_time
+ FROM events
+ WHERE DATE(commence_time) BETWEEN '{start_date}' AND '{end_date}'
+ """
+
+ events_df = pd.read_sql_query(query, self.db.conn)
+
+ # Parse commence_time
+ invalid_times = []
+ timezones_found = set()
+
+ for _idx, row in events_df.iterrows():
+ try:
+ dt = datetime.fromisoformat(row["commence_time"].replace("Z", "+00:00"))
+ # Check if timezone-aware
+ if dt.tzinfo is None:
+ self.issues["temporal"].append(
+ f"commence_time is timezone-naive: {row['event_id']}"
+ )
+ else:
+ timezones_found.add(str(dt.tzinfo))
+ except (ValueError, AttributeError) as e:
+ invalid_times.append((row["event_id"], str(e)))
+
+ if invalid_times:
+ self.issues["temporal"].append(
+ f"Invalid commence_time format in {len(invalid_times)} events"
+ )
+ for event_id, error in invalid_times[:5]: # Show first 5
+ self.issues["temporal"].append(f" {event_id}: {error}")
+
+ # Check if all times are in same timezone
+ if len(timezones_found) > 1:
+ self.issues["temporal"].append(
+ f"Multiple timezones found in commence_time: {timezones_found}"
+ )
+
+ self.stats["commence_time_timezone"] = (
+ list(timezones_found)[0] if len(timezones_found) == 1 else "MIXED"
+ )
+
+ # Check as_of in observations
+ obs_query = "SELECT DISTINCT as_of FROM observations LIMIT 1000"
+ obs_df = pd.read_sql_query(obs_query, self.db.conn)
+
+ obs_timezones = set()
+ for as_of in obs_df["as_of"]:
+ try:
+ dt = datetime.fromisoformat(as_of.replace("Z", "+00:00"))
+ if dt.tzinfo:
+ obs_timezones.add(str(dt.tzinfo))
+ except (ValueError, AttributeError):
+ pass
+
+ if len(obs_timezones) > 1:
+ self.issues["temporal"].append(
+ f"Multiple timezones in observations.as_of: {obs_timezones}"
+ )
+
+ self.stats["as_of_timezone"] = (
+ list(obs_timezones)[0] if len(obs_timezones) == 1 else "MIXED"
+ )
+
+ logger.info(f" Commence time timezone: {self.stats['commence_time_timezone']}")
+ logger.info(f" Observations timezone: {self.stats['as_of_timezone']}")
+
+ def _validate_team_mappings(self) -> None:
+ """Validate team name mappings across sources."""
+ # Load team mapping
+ try:
+ mapper = TeamMapper(read_parquet_df(self.team_mapping_path))
+ mapping_df = mapper.mapping
+ self.stats["mapped_teams"] = len(mapping_df)
+ logger.info(f" Team mapping loaded: {len(mapping_df)} teams")
+ except FileNotFoundError:
+ self.issues["team_mapping"].append(
+ f"Team mapping file not found: {self.team_mapping_path}"
+ )
+ return
+
+ # Get unique teams from Odds API
+ odds_teams_query = """
+ SELECT DISTINCT home_team as team FROM events
+ UNION
+ SELECT DISTINCT away_team as team FROM events
+ """
+ odds_teams = pd.read_sql_query(odds_teams_query, self.db.conn)["team"].tolist()
+
+ self.stats["odds_api_teams"] = len(odds_teams)
+ logger.info(f" Odds API teams: {len(odds_teams)}")
+
+ # Check mapping coverage
+ unmapped_odds = []
+ for team in odds_teams:
+ kenpom_name = mapper.get_kenpom_name(team, source="odds_api")
+ # If kenpom_name equals team, it means no mapping was found
+ if kenpom_name == team and team not in mapping_df["kenpom_name"].values:
+ unmapped_odds.append(team)
+
+ if unmapped_odds:
+ self.issues["team_mapping"].append(
+ f"Unmapped Odds API teams ({len(unmapped_odds)}): {unmapped_odds[:10]}"
+ )
+ self.stats["unmapped_odds_teams"] = len(unmapped_odds)
+ else:
+ logger.info(" [OK] All Odds API teams are mapped")
+ self.stats["unmapped_odds_teams"] = 0
+
+ # Check for KenPom teams not in mapping
+ kenpom_ratings_path = self.kenpom_path / "ratings" / "season" / "ratings_2026.parquet"
+ if kenpom_ratings_path.exists():
+ kenpom_df = read_parquet_df(kenpom_ratings_path)
+ kenpom_teams = kenpom_df["TeamName"].unique().tolist()
+
+ self.stats["kenpom_teams"] = len(kenpom_teams)
+ logger.info(f" KenPom teams: {len(kenpom_teams)}")
+
+ # Teams in KenPom but not in mapping
+ unmapped_kenpom = [t for t in kenpom_teams if t not in mapping_df["kenpom_name"].values]
+
+ if unmapped_kenpom:
+ self.issues["team_mapping"].append(
+ f"KenPom teams not in mapping ({len(unmapped_kenpom)}): {unmapped_kenpom[:10]}"
+ )
+ self.stats["unmapped_kenpom_teams"] = len(unmapped_kenpom)
+ else:
+ logger.info(" [OK] All KenPom teams are in mapping")
+ self.stats["unmapped_kenpom_teams"] = 0
+
+ def _validate_data_integrity(self) -> None:
+ """Check for orphaned records and data consistency."""
+ # Count records
+ events_count = pd.read_sql_query("SELECT COUNT(*) as cnt FROM events", self.db.conn)["cnt"][
+ 0
+ ]
+ obs_count = pd.read_sql_query("SELECT COUNT(*) as cnt FROM observations", self.db.conn)[
+ "cnt"
+ ][0]
+ scores_count = pd.read_sql_query("SELECT COUNT(*) as cnt FROM scores", self.db.conn)["cnt"][
+ 0
+ ]
+
+ self.stats["total_events"] = events_count
+ self.stats["total_observations"] = obs_count
+ self.stats["total_scores"] = scores_count
+
+ logger.info(f" Events: {events_count:,}")
+ logger.info(f" Observations: {obs_count:,}")
+ logger.info(f" Scores: {scores_count:,}")
+
+ # Check for observations without events
+ orphaned_obs_query = """
+ SELECT COUNT(*) as cnt
+ FROM observations o
+ LEFT JOIN events e ON o.event_id = e.event_id
+ WHERE e.event_id IS NULL
+ """
+ orphaned_obs = pd.read_sql_query(orphaned_obs_query, self.db.conn)["cnt"][0]
+
+ if orphaned_obs > 0:
+ self.issues["integrity"].append(
+ f"Orphaned observations (no matching event): {orphaned_obs}"
+ )
+ self.stats["orphaned_observations"] = orphaned_obs
+ else:
+ self.stats["orphaned_observations"] = 0
+
+ # Check for scores without events
+ orphaned_scores_query = """
+ SELECT COUNT(*) as cnt
+ FROM scores s
+ LEFT JOIN events e ON s.event_id = e.event_id
+ WHERE e.event_id IS NULL
+ """
+ orphaned_scores = pd.read_sql_query(orphaned_scores_query, self.db.conn)["cnt"][0]
+
+ if orphaned_scores > 0:
+ self.issues["integrity"].append(
+ f"Orphaned scores (no matching event): {orphaned_scores}"
+ )
+ self.stats["orphaned_scores"] = orphaned_scores
+ else:
+ self.stats["orphaned_scores"] = 0
+
+ # Check for events with scores but no odds
+ events_with_scores_no_odds_query = """
+ SELECT COUNT(DISTINCT s.event_id) as cnt
+ FROM scores s
+ INNER JOIN events e ON s.event_id = e.event_id
+ LEFT JOIN observations o ON s.event_id = o.event_id
+ WHERE o.event_id IS NULL
+ AND s.completed = 1
+ """
+ no_odds = pd.read_sql_query(events_with_scores_no_odds_query, self.db.conn)["cnt"][0]
+
+ if no_odds > 0:
+ self.issues["integrity"].append(f"Completed games with scores but no odds: {no_odds}")
+ self.stats["scores_without_odds"] = no_odds
+ else:
+ self.stats["scores_without_odds"] = 0
+
+ # Check for events with odds but no scores (expected for future games)
+ events_with_odds_no_scores_query = """
+ SELECT COUNT(DISTINCT e.event_id) as cnt
+ FROM events e
+ INNER JOIN observations o ON e.event_id = o.event_id
+ LEFT JOIN scores s ON e.event_id = s.event_id
+ WHERE s.event_id IS NULL
+ AND DATE(e.commence_time) < DATE('now')
+ """
+ past_no_scores = pd.read_sql_query(events_with_odds_no_scores_query, self.db.conn)["cnt"][0]
+
+ if past_no_scores > 0:
+ self.issues["integrity"].append(f"Past games with odds but no scores: {past_no_scores}")
+ self.stats["past_games_no_scores"] = past_no_scores
+ else:
+ self.stats["past_games_no_scores"] = 0
+
+ def _validate_kenpom_data(self) -> None:
+ """Validate KenPom data availability and quality."""
+ # Check for ratings file
+ ratings_path = self.kenpom_path / "ratings" / "season" / "ratings_2026.parquet"
+ if not ratings_path.exists():
+ self.issues["kenpom"].append(f"KenPom ratings not found: {ratings_path}")
+ return
+
+ ratings_df = read_parquet_df(ratings_path)
+ self.stats["kenpom_teams_with_ratings"] = len(ratings_df)
+
+ # Check for required columns
+ required_cols = ["TeamName", "AdjEM", "AdjOE", "AdjDE", "AdjTempo"]
+ missing_cols = [col for col in required_cols if col not in ratings_df.columns]
+
+ if missing_cols:
+ self.issues["kenpom"].append(f"Missing KenPom columns: {missing_cols}")
+
+ # Check for NaN values in key metrics
+ for col in ["AdjEM", "AdjOE", "AdjDE"]:
+ if col in ratings_df.columns:
+ nan_count = ratings_df[col].isna().sum()
+ if nan_count > 0:
+ self.issues["kenpom"].append(f"NaN values in {col}: {nan_count} teams")
+
+ logger.info(f" KenPom teams with ratings: {len(ratings_df)}")
+
+ # Check four factors
+ ff_path = self.kenpom_path / "four-factors" / "season" / "four-factors_2026.parquet"
+ if not ff_path.exists():
+ self.issues["kenpom"].append(f"Four factors not found: {ff_path}")
+ else:
+ ff_df = read_parquet_df(ff_path)
+ self.stats["kenpom_teams_with_ff"] = len(ff_df)
+ logger.info(f" KenPom teams with four factors: {len(ff_df)}")
+
+ def _validate_date_coverage(self) -> None:
+ """Check date range coverage across sources."""
+ # Events date range
+ date_range_query = """
+ SELECT
+ MIN(DATE(commence_time)) as earliest,
+ MAX(DATE(commence_time)) as latest,
+ COUNT(DISTINCT DATE(commence_time)) as unique_dates
+ FROM events
+ """
+ date_range = pd.read_sql_query(date_range_query, self.db.conn).iloc[0]
+
+ self.stats["earliest_event"] = date_range["earliest"]
+ self.stats["latest_event"] = date_range["latest"]
+ self.stats["unique_event_dates"] = date_range["unique_dates"]
+
+ logger.info(f" Events date range: {date_range['earliest']} to {date_range['latest']}")
+ logger.info(f" Unique dates with events: {date_range['unique_dates']}")
+
+ # Scores date range
+ scores_range_query = """
+ SELECT
+ MIN(DATE(e.commence_time)) as earliest,
+ MAX(DATE(e.commence_time)) as latest,
+ COUNT(DISTINCT DATE(e.commence_time)) as unique_dates
+ FROM scores s
+ INNER JOIN events e ON s.event_id = e.event_id
+ WHERE s.completed = 1
+ """
+ scores_range = pd.read_sql_query(scores_range_query, self.db.conn).iloc[0]
+
+ self.stats["earliest_score"] = scores_range["earliest"]
+ self.stats["latest_score"] = scores_range["latest"]
+ self.stats["unique_score_dates"] = scores_range["unique_dates"]
+
+ logger.info(f" Scores date range: {scores_range['earliest']} to {scores_range['latest']}")
+ logger.info(f" Unique dates with scores: {scores_range['unique_dates']}")
+
+ def _print_summary(self) -> None:
+ """Print validation summary."""
+ logger.info("\n" + "=" * 80)
+ logger.info("VALIDATION SUMMARY")
+ logger.info("=" * 80)
+
+ if len(self.issues) == 0:
+ logger.info("\n[OK] All validation checks passed!")
+ else:
+ logger.info(f"\n[WARNING] Found {len(self.issues)} issue categories:")
+ for category, issue_list in self.issues.items():
+ logger.info(f"\n{category.upper()} ({len(issue_list)} issues):")
+ for issue in issue_list:
+ logger.info(f" - {issue}")
+
+ logger.info("\n" + "=" * 80)
+
+ def export_report(self, output_path: Path) -> None:
+ """Export detailed validation report to file.
+
+ Args:
+ output_path: Path to save report
+ """
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+
+ with open(output_path, "w") as f:
+ f.write("DATA QUALITY VALIDATION REPORT\n")
+ f.write("=" * 80 + "\n")
+ f.write(f"Generated: {datetime.now().isoformat()}\n\n")
+
+ f.write("STATISTICS\n")
+ f.write("-" * 80 + "\n")
+ for key, value in sorted(self.stats.items()):
+ f.write(f"{key}: {value}\n")
+
+ f.write("\n\nISSUES\n")
+ f.write("-" * 80 + "\n")
+ if len(self.issues) == 0:
+ f.write("No issues found!\n")
+ else:
+ for category, issue_list in self.issues.items():
+ f.write(f"\n{category.upper()}:\n")
+ for issue in issue_list:
+ f.write(f" - {issue}\n")
+
+ logger.info(f"\nReport exported to: {output_path}")
+
+
+def main() -> None:
+ """Run data quality validation."""
+ parser = argparse.ArgumentParser(description="Validate data quality for CLV tracking")
+ parser.add_argument(
+ "--odds-db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to Odds API database",
+ )
+ parser.add_argument(
+ "--kenpom-path",
+ type=Path,
+ default=Path("data/kenpom"),
+ help="Path to KenPom data directory",
+ )
+ parser.add_argument(
+ "--team-mapping",
+ type=Path,
+ default=Path("data/staging/mappings/team_mapping.parquet"),
+ help="Path to team mapping file",
+ )
+ parser.add_argument(
+ "--start",
+ type=str,
+ help="Start date for temporal checks (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--end",
+ type=str,
+ help="End date for temporal checks (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ help="Path to export detailed report",
+ )
+
+ args = parser.parse_args()
+
+ # Initialize validator
+ validator = DataQualityValidator(
+ odds_db_path=args.odds_db,
+ kenpom_path=args.kenpom_path,
+ team_mapping_path=args.team_mapping,
+ )
+
+ # Run validation
+ results = validator.validate_all(start_date=args.start, end_date=args.end)
+
+ # Export report if requested
+ if args.output:
+ validator.export_report(args.output)
+
+ # Exit with error if validation failed
+ if not results["passed"]:
+ logger.error("\n[ERROR] Validation failed! See issues above.")
+ exit(1)
+ else:
+ logger.info("\n[OK] All validation checks passed!")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/validate_matchups.py b/scripts/analysis/validate_matchups.py
new file mode 100644
index 000000000..cba1b12d0
--- /dev/null
+++ b/scripts/analysis/validate_matchups.py
@@ -0,0 +1,357 @@
+"""Validate matchups across all data sources.
+
+This script checks for:
+- Consistent team names via canonical mapping
+- Proper date/time normalization
+- Home/away designation conflicts
+- Time discrepancies between sources
+
+Usage:
+ uv run python scripts/validate_matchups.py
+
+Output:
+ - Validation report with conflicts and warnings
+ - Summary statistics
+"""
+
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.core.matchup import (
+ MatchupValidator,
+ normalize_dataframe_times,
+)
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def load_team_mapping() -> pd.DataFrame:
+ """Load canonical team mapping."""
+ mapping_path = Path("data/processed/team_mapping.parquet")
+ if not mapping_path.exists():
+ raise FileNotFoundError(f"Team mapping not found: {mapping_path}")
+
+ df = read_parquet_df(str(mapping_path))
+ logger.info(f"Loaded team mapping with {len(df)} teams")
+ return df
+
+
+def load_overtime_data() -> pd.DataFrame | None:
+ """Load most recent Overtime.ag data."""
+ overtime_dir = Path("data/overtime")
+ if not overtime_dir.exists():
+ logger.warning("No Overtime.ag data directory found")
+ return None
+
+ parquet_files = list(overtime_dir.glob("*.parquet"))
+ if not parquet_files:
+ logger.warning("No Overtime.ag parquet files found")
+ return None
+
+ # Get most recent file
+ latest = max(parquet_files, key=lambda p: p.stat().st_mtime)
+ logger.info(f"Loading Overtime.ag data from {latest.name}")
+
+ df = read_parquet_df(str(latest))
+ logger.info(f" Loaded {len(df)} events from Overtime.ag")
+ return df
+
+
+def load_odds_api_data() -> pd.DataFrame | None:
+ """Load most recent Odds API data."""
+ odds_dir = Path("data/odds_api/sample")
+ if not odds_dir.exists():
+ logger.warning("No Odds API sample directory found")
+ return None
+
+ parquet_files = list(odds_dir.glob("*.parquet"))
+ if not parquet_files:
+ logger.warning("No Odds API sample files found")
+ return None
+
+ # Get most recent file
+ latest = max(parquet_files, key=lambda p: p.stat().st_mtime)
+ logger.info(f"Loading Odds API data from {latest.name}")
+
+ df = read_parquet_df(str(latest))
+ logger.info(f" Loaded {len(df)} events from Odds API")
+ return df
+
+
+def validate_overtime_matchups(df: pd.DataFrame, validator: MatchupValidator) -> pd.DataFrame:
+ """Validate Overtime.ag matchups.
+
+ Args:
+ df: Overtime.ag DataFrame
+ validator: MatchupValidator instance
+
+ Returns:
+ DataFrame with validation results
+ """
+ logger.info("Validating Overtime.ag matchups...")
+
+ # Normalize timestamps
+ df = normalize_dataframe_times(df, "captured_at", output_prefix="captured")
+
+ # Get canonical team IDs
+ df["home_team_id"] = df["home_team"].apply(
+ lambda x: validator.get_canonical_team_id(x, "overtime")
+ )
+ df["away_team_id"] = df["away_team"].apply(
+ lambda x: validator.get_canonical_team_id(x, "overtime")
+ )
+
+ # Check for unmapped teams
+ unmapped_home = df[df["home_team_id"].isna()]["home_team"].unique()
+ unmapped_away = df[df["away_team_id"].isna()]["away_team"].unique()
+
+ if len(unmapped_home) > 0:
+ logger.warning(f"Unmapped home teams: {list(unmapped_home)}")
+ if len(unmapped_away) > 0:
+ logger.warning(f"Unmapped away teams: {list(unmapped_away)}")
+
+ # Filter to mapped teams
+ df_mapped = df[df["home_team_id"].notna() & df["away_team_id"].notna()].copy()
+ logger.info(f" {len(df_mapped)} events with mapped teams")
+
+ # Create matchup keys
+ df_mapped["matchup_key"] = df_mapped.apply(
+ lambda row: validator.create_matchup_key(
+ int(row["home_team_id"]), int(row["away_team_id"]), row["captured_date"]
+ ),
+ axis=1,
+ )
+
+ return df_mapped
+
+
+def validate_odds_api_matchups(df: pd.DataFrame, validator: MatchupValidator) -> pd.DataFrame:
+ """Validate Odds API matchups.
+
+ Args:
+ df: Odds API DataFrame
+ validator: MatchupValidator instance
+
+ Returns:
+ DataFrame with validation results
+ """
+ logger.info("Validating Odds API matchups...")
+
+ # Normalize timestamps
+ df = normalize_dataframe_times(df, "commence_time", output_prefix="game")
+
+ # Get canonical team IDs
+ df["home_team_id"] = df["home_team"].apply(
+ lambda x: validator.get_canonical_team_id(x, "odds_api")
+ )
+ df["away_team_id"] = df["away_team"].apply(
+ lambda x: validator.get_canonical_team_id(x, "odds_api")
+ )
+
+ # Check for unmapped teams
+ unmapped_home = df[df["home_team_id"].isna()]["home_team"].unique()
+ unmapped_away = df[df["away_team_id"].isna()]["away_team"].unique()
+
+ if len(unmapped_home) > 0:
+ logger.warning(f"Unmapped home teams: {list(unmapped_home)}")
+ if len(unmapped_away) > 0:
+ logger.warning(f"Unmapped away teams: {list(unmapped_away)}")
+
+ # Filter to mapped teams
+ df_mapped = df[df["home_team_id"].notna() & df["away_team_id"].notna()].copy()
+ logger.info(f" {len(df_mapped)} events with mapped teams")
+
+ # Create matchup keys
+ df_mapped["matchup_key"] = df_mapped.apply(
+ lambda row: validator.create_matchup_key(
+ int(row["home_team_id"]), int(row["away_team_id"]), row["game_date"]
+ ),
+ axis=1,
+ )
+
+ return df_mapped
+
+
+def find_common_matchups(overtime_df: pd.DataFrame, odds_api_df: pd.DataFrame) -> pd.DataFrame:
+ """Find matchups that appear in both data sources.
+
+ Args:
+ overtime_df: Validated Overtime.ag DataFrame
+ odds_api_df: Validated Odds API DataFrame
+
+ Returns:
+ DataFrame with common matchups and comparison
+ """
+ logger.info("Finding common matchups across sources...")
+
+ # Get common matchup keys
+ overtime_keys = set(overtime_df["matchup_key"].unique())
+ odds_api_keys = set(odds_api_df["matchup_key"].unique())
+ common_keys = overtime_keys & odds_api_keys
+
+ logger.info(f" Overtime.ag matchups: {len(overtime_keys)}")
+ logger.info(f" Odds API matchups: {len(odds_api_keys)}")
+ logger.info(f" Common matchups: {len(common_keys)}")
+
+ if len(common_keys) == 0:
+ logger.warning("No common matchups found!")
+ return pd.DataFrame()
+
+ # Get details for common matchups
+ overtime_common = overtime_df[overtime_df["matchup_key"].isin(common_keys)].copy()
+ odds_api_common = odds_api_df[odds_api_df["matchup_key"].isin(common_keys)].copy()
+
+ # Merge on matchup key
+ merged = overtime_common.merge(
+ odds_api_common,
+ on="matchup_key",
+ how="inner",
+ suffixes=("_overtime", "_odds"),
+ )
+
+ return merged
+
+
+def check_time_discrepancies(common_df: pd.DataFrame) -> pd.DataFrame:
+ """Check for time discrepancies between sources.
+
+ Args:
+ common_df: DataFrame with common matchups
+
+ Returns:
+ DataFrame with time discrepancy analysis
+ """
+ if len(common_df) == 0:
+ return pd.DataFrame()
+
+ logger.info("Checking time discrepancies...")
+
+ # Calculate time differences (both should be in UTC)
+ common_df["time_diff_hours"] = (
+ common_df["game_datetime_utc"] - common_df["captured_datetime_utc"]
+ ).dt.total_seconds() / 3600
+
+ # Flag significant discrepancies (> 2 hours)
+ common_df["time_discrepancy"] = common_df["time_diff_hours"].abs() > 2
+
+ discrepancies = common_df[common_df["time_discrepancy"]]
+ if len(discrepancies) > 0:
+ logger.warning(f"Found {len(discrepancies)} matchups with >2 hour time discrepancy")
+ for _, row in discrepancies.head(5).iterrows():
+ logger.warning(
+ f" {row['home_team_overtime']} vs {row['away_team_overtime']}: "
+ f"{row['time_diff_hours']:.1f} hours"
+ )
+ else:
+ logger.info("No significant time discrepancies found")
+
+ return common_df
+
+
+def generate_validation_report(
+ overtime_df: pd.DataFrame | None,
+ odds_api_df: pd.DataFrame | None,
+ common_df: pd.DataFrame,
+) -> None:
+ """Generate validation report.
+
+ Args:
+ overtime_df: Validated Overtime.ag DataFrame
+ odds_api_df: Validated Odds API DataFrame
+ common_df: Common matchups DataFrame
+ """
+ print("\n" + "=" * 80)
+ print(" MATCHUP VALIDATION REPORT")
+ print("=" * 80)
+
+ if overtime_df is None and odds_api_df is None:
+ print("\n[WARNING] No data sources available for validation")
+ return
+
+ print("\nDATA SOURCES:")
+ if overtime_df is not None:
+ print(f" Overtime.ag: {len(overtime_df)} matchups")
+ date_min = overtime_df["captured_date"].min()
+ date_max = overtime_df["captured_date"].max()
+ print(f" Date range: {date_min} to {date_max}")
+ else:
+ print(" Overtime.ag: No data")
+
+ if odds_api_df is not None:
+ print(f" Odds API: {len(odds_api_df)} matchups")
+ print(
+ f" Date range: {odds_api_df['game_date'].min()} to {odds_api_df['game_date'].max()}"
+ )
+ else:
+ print(" Odds API: No data")
+
+ if len(common_df) > 0:
+ print(f"\nCOMMON MATCHUPS: {len(common_df)}")
+ total = max(len(overtime_df or []), len(odds_api_df or []))
+ coverage = len(common_df) / total * 100
+ print(f" Coverage: {coverage:.1f}%")
+
+ # Time discrepancy summary
+ if "time_discrepancy" in common_df.columns:
+ discrepancies = common_df["time_discrepancy"].sum()
+ print("\nTIME DISCREPANCIES:")
+ print(f" Matchups with >2 hour difference: {discrepancies}")
+ if discrepancies == 0:
+ print(" [OK] All times align within 2-hour window")
+
+ # Sample matchups
+ print("\nSAMPLE MATCHUPS (First 5):")
+ print("-" * 80)
+ for _, row in common_df.head(5).iterrows():
+ print(f" {row['home_team_overtime']} vs {row['away_team_overtime']}")
+ print(f" Matchup key: {row['matchup_key']}")
+ if "game_datetime_utc" in row:
+ print(f" Odds API time: {row['game_datetime_utc']}")
+ if "captured_datetime_utc" in row:
+ print(f" Overtime time: {row['captured_datetime_utc']}")
+ print()
+
+ print("=" * 80)
+
+
+def main() -> None:
+ """Run matchup validation."""
+ logger.info("Starting matchup validation...")
+
+ # Load team mapping
+ team_mapping = load_team_mapping()
+ validator = MatchupValidator(team_mapping)
+
+ # Load data sources
+ overtime_df = load_overtime_data()
+ odds_api_df = load_odds_api_data()
+
+ # Validate each source
+ validated_overtime = None
+ validated_odds_api = None
+
+ if overtime_df is not None:
+ validated_overtime = validate_overtime_matchups(overtime_df, validator)
+
+ if odds_api_df is not None:
+ validated_odds_api = validate_odds_api_matchups(odds_api_df, validator)
+
+ # Find common matchups
+ common_df = pd.DataFrame()
+ if validated_overtime is not None and validated_odds_api is not None:
+ common_df = find_common_matchups(validated_overtime, validated_odds_api)
+ if len(common_df) > 0:
+ common_df = check_time_discrepancies(common_df)
+
+ # Generate report
+ generate_validation_report(validated_overtime, validated_odds_api, common_df)
+
+ logger.info("\nValidation complete!")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/validate_predictions.py b/scripts/analysis/validate_predictions.py
new file mode 100644
index 000000000..b6a231ca5
--- /dev/null
+++ b/scripts/analysis/validate_predictions.py
@@ -0,0 +1,361 @@
+"""
+Validate model predictions against actual game results.
+
+Calculates performance metrics:
+- MAE (Mean Absolute Error) for scores and totals
+- RMSE (Root Mean Squared Error)
+- Bias (systematic over/under prediction)
+- Spread cover accuracy
+- Total over/under accuracy
+- Calibration quality
+
+Usage:
+ uv run python scripts/analysis/validate_predictions.py --date 2026-02-07
+ uv run python scripts/analysis/validate_predictions.py --predictions predictions/2026-02-07.csv
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from datetime import date
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import write_csv
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+logging.basicConfig(level=logging.INFO, format="%(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def load_predictions(predictions_file: Path) -> pd.DataFrame:
+ """Load predictions from CSV."""
+ if not predictions_file.exists():
+ raise FileNotFoundError(f"Predictions file not found: {predictions_file}")
+
+ df = pd.read_csv(predictions_file)
+ logger.info(f"Loaded {len(df)} predictions from {predictions_file}")
+
+ return df
+
+
+def get_actual_scores(db: OddsAPIDatabase, predictions: pd.DataFrame) -> pd.DataFrame:
+ """Get actual scores for predicted games."""
+ # Get event IDs or match by team names
+ results = []
+
+ for _, pred in predictions.iterrows():
+ # Try to find matching game by team names and date
+ # This is approximate - ideally use event_id
+ query = f"""
+ SELECT
+ e.event_id,
+ e.home_team,
+ e.away_team,
+ e.commence_time,
+ s.home_score,
+ s.away_score
+ FROM events e
+ JOIN scores s ON e.event_id = s.event_id
+ WHERE e.home_team LIKE '%{pred["home_team"].split()[-1]}%'
+ AND e.away_team LIKE '%{pred["away_team"].split()[-1]}%'
+ AND s.home_score IS NOT NULL
+ AND s.away_score IS NOT NULL
+ ORDER BY ABS(
+ JULIANDAY(e.commence_time) - JULIANDAY('now')
+ ) ASC
+ LIMIT 1
+ """
+
+ game = pd.read_sql_query(query, db.conn)
+
+ if len(game) > 0:
+ result = {
+ "home_team": pred["home_team"],
+ "away_team": pred["away_team"],
+ "predicted_home_score": pred["predicted_home_score"],
+ "predicted_away_score": pred["predicted_away_score"],
+ "predicted_total": pred["predicted_total"],
+ "predicted_margin": pred["predicted_margin"],
+ "actual_home_score": game.iloc[0]["home_score"],
+ "actual_away_score": game.iloc[0]["away_score"],
+ "actual_total": (game.iloc[0]["home_score"] + game.iloc[0]["away_score"]),
+ "actual_margin": (game.iloc[0]["home_score"] - game.iloc[0]["away_score"]),
+ "favorite_team": pred["favorite_team"],
+ "spread_magnitude": pred["spread_magnitude"],
+ "total_points": pred["total_points"],
+ }
+ results.append(result)
+
+ if not results:
+ logger.warning("No matching games with scores found")
+ return pd.DataFrame()
+
+ df = pd.DataFrame(results)
+ logger.info(f"Found actual scores for {len(df)} games")
+
+ return df
+
+
+def calculate_metrics(results: pd.DataFrame) -> dict:
+ """Calculate prediction performance metrics."""
+ metrics = {}
+
+ # Score prediction metrics
+ metrics["home_mae"] = np.mean(
+ np.abs(results["predicted_home_score"] - results["actual_home_score"])
+ )
+ metrics["away_mae"] = np.mean(
+ np.abs(results["predicted_away_score"] - results["actual_away_score"])
+ )
+ metrics["home_rmse"] = np.sqrt(
+ np.mean((results["predicted_home_score"] - results["actual_home_score"]) ** 2)
+ )
+ metrics["away_rmse"] = np.sqrt(
+ np.mean((results["predicted_away_score"] - results["actual_away_score"]) ** 2)
+ )
+
+ # Bias (positive = overpredicting, negative = underpredicting)
+ metrics["home_bias"] = np.mean(results["predicted_home_score"] - results["actual_home_score"])
+ metrics["away_bias"] = np.mean(results["predicted_away_score"] - results["actual_away_score"])
+
+ # Total prediction metrics
+ metrics["total_mae"] = np.mean(np.abs(results["predicted_total"] - results["actual_total"]))
+ metrics["total_rmse"] = np.sqrt(
+ np.mean((results["predicted_total"] - results["actual_total"]) ** 2)
+ )
+ metrics["total_bias"] = np.mean(results["predicted_total"] - results["actual_total"])
+
+ # Margin prediction metrics
+ metrics["margin_mae"] = np.mean(np.abs(results["predicted_margin"] - results["actual_margin"]))
+ metrics["margin_rmse"] = np.sqrt(
+ np.mean((results["predicted_margin"] - results["actual_margin"]) ** 2)
+ )
+ metrics["margin_bias"] = np.mean(results["predicted_margin"] - results["actual_margin"])
+
+ # Spread accuracy (did favorite cover?)
+ results["favorite_covered"] = False
+ for idx, row in results.iterrows():
+ if row["favorite_team"] == row["home_team"]:
+ # Home is favorite
+ favorite_margin = row["actual_margin"]
+ results.loc[idx, "favorite_covered"] = favorite_margin > row["spread_magnitude"]
+ else:
+ # Away is favorite
+ favorite_margin = -row["actual_margin"]
+ results.loc[idx, "favorite_covered"] = favorite_margin > row["spread_magnitude"]
+
+ results["predicted_favorite_cover"] = False
+ for idx, row in results.iterrows():
+ if row["favorite_team"] == row["home_team"]:
+ pred_margin = row["predicted_margin"]
+ results.loc[idx, "predicted_favorite_cover"] = pred_margin > row["spread_magnitude"]
+ else:
+ pred_margin = -row["predicted_margin"]
+ results.loc[idx, "predicted_favorite_cover"] = pred_margin > row["spread_magnitude"]
+
+ metrics["spread_accuracy"] = (
+ results["favorite_covered"] == results["predicted_favorite_cover"]
+ ).mean()
+
+ # Total accuracy (did it go over?)
+ results["went_over"] = results["actual_total"] > results["total_points"]
+ results["predicted_over"] = results["predicted_total"] > results["total_points"]
+ metrics["total_accuracy"] = (results["went_over"] == results["predicted_over"]).mean()
+
+ # Market vs model comparison
+ metrics["avg_market_total"] = results["total_points"].mean()
+ metrics["avg_actual_total"] = results["actual_total"].mean()
+ metrics["avg_predicted_total"] = results["predicted_total"].mean()
+
+ # Market performance (how far off was market?)
+ metrics["market_total_mae"] = np.mean(np.abs(results["total_points"] - results["actual_total"]))
+ metrics["market_total_bias"] = np.mean(results["total_points"] - results["actual_total"])
+
+ return metrics, results
+
+
+def print_metrics_report(metrics: dict, results: pd.DataFrame) -> None:
+ """Print formatted metrics report."""
+ print("\n" + "=" * 70)
+ print("MODEL VALIDATION REPORT")
+ print("=" * 70)
+ print(f"\nGames Analyzed: {len(results)}")
+ print(f"Date Range: {results.index.min()} to {results.index.max()}")
+
+ print("\n--- SCORE PREDICTION METRICS ---")
+ print(f"Home Score MAE: {metrics['home_mae']:.2f} points")
+ print(f"Away Score MAE: {metrics['away_mae']:.2f} points")
+ print(f"Home Score RMSE: {metrics['home_rmse']:.2f} points")
+ print(f"Away Score RMSE: {metrics['away_rmse']:.2f} points")
+
+ print("\n--- BIAS (Systematic Over/Under Prediction) ---")
+ print(
+ f"Home Score Bias: {metrics['home_bias']:+.2f} "
+ f"({'over' if metrics['home_bias'] > 0 else 'under'}predicting)"
+ )
+ print(
+ f"Away Score Bias: {metrics['away_bias']:+.2f} "
+ f"({'over' if metrics['away_bias'] > 0 else 'under'}predicting)"
+ )
+ print(
+ f"Total Bias: {metrics['total_bias']:+.2f} "
+ f"({'over' if metrics['total_bias'] > 0 else 'under'}predicting)"
+ )
+ print(
+ f"Margin Bias: {metrics['margin_bias']:+.2f} "
+ f"({'over' if metrics['margin_bias'] > 0 else 'under'}predicting home)"
+ )
+
+ print("\n--- TOTAL PREDICTION ---")
+ print(f"Total MAE: {metrics['total_mae']:.2f} points")
+ print(f"Total RMSE: {metrics['total_rmse']:.2f} points")
+ print(f"Total Accuracy (O/U): {metrics['total_accuracy']:.1%}")
+
+ print("\n--- MARGIN PREDICTION ---")
+ print(f"Margin MAE: {metrics['margin_mae']:.2f} points")
+ print(f"Margin RMSE: {metrics['margin_rmse']:.2f} points")
+
+ print("\n--- BETTING PERFORMANCE ---")
+ print(f"Spread Cover Accuracy: {metrics['spread_accuracy']:.1%}")
+ print(f"Total O/U Accuracy: {metrics['total_accuracy']:.1%}")
+
+ print("\n--- MARKET COMPARISON ---")
+ print(f"Market Total (avg): {metrics['avg_market_total']:.1f}")
+ print(f"Actual Total (avg): {metrics['avg_actual_total']:.1f}")
+ print(f"Model Total (avg): {metrics['avg_predicted_total']:.1f}")
+ print(f"\nMarket Total MAE: {metrics['market_total_mae']:.2f} points")
+ print(
+ f"Model Total MAE: {metrics['total_mae']:.2f} points "
+ f"({'better' if metrics['total_mae'] < metrics['market_total_mae'] else 'worse'}"
+ " than market)"
+ )
+ print(f"\nMarket Total Bias: {metrics['market_total_bias']:+.2f}")
+ print(f"Model Total Bias: {metrics['total_bias']:+.2f}")
+
+ print("\n--- CALIBRATION ASSESSMENT ---")
+ if abs(metrics["total_bias"]) < 1.0:
+ print("[OK] Model is well-calibrated (bias < 1 point)")
+ elif abs(metrics["total_bias"]) < 3.0:
+ print("[WARNING] Model has slight bias (1-3 points)")
+ else:
+ direction = "over" if metrics["total_bias"] > 0 else "under"
+ print(
+ f"[ERROR] Model significantly {direction}predicting "
+ f"(bias: {abs(metrics['total_bias']):.1f} points)"
+ )
+
+ print("\n" + "=" * 70)
+
+
+def print_game_by_game(results: pd.DataFrame) -> None:
+ """Print game-by-game results."""
+ print("\n--- GAME BY GAME RESULTS ---\n")
+
+ # Sort by largest total error
+ results["total_error"] = abs(results["predicted_total"] - results["actual_total"])
+ sorted_results = results.sort_values("total_error", ascending=False)
+
+ for _idx, row in sorted_results.iterrows():
+ matchup = f"{row['away_team']} @ {row['home_team']}"
+ print(f"{matchup[:50]:50s}")
+
+ # Scores
+ pred_score = f"{row['predicted_home_score']:.1f}-{row['predicted_away_score']:.1f}"
+ actual_score = f"{row['actual_home_score']:.0f}-{row['actual_away_score']:.0f}"
+ print(f" Predicted: {pred_score:10s} | Actual: {actual_score:10s}")
+
+ # Total
+ total_error = row["predicted_total"] - row["actual_total"]
+ total_status = "OVER" if row["predicted_total"] > row["total_points"] else "UNDER"
+ actual_status = "OVER" if row["actual_total"] > row["total_points"] else "UNDER"
+ correct = "OK" if total_status == actual_status else "MISS"
+
+ print(
+ f" Total: {row['predicted_total']:.1f} pred vs {row['actual_total']:.0f} actual "
+ f"(market: {row['total_points']:.1f}) | Error: {total_error:+.1f} | {correct}"
+ )
+
+ # Margin
+ margin_error = row["predicted_margin"] - row["actual_margin"]
+ print(f" Margin: {margin_error:+.1f} error")
+ print()
+
+
+def main() -> None:
+ """Main execution."""
+ parser = argparse.ArgumentParser(description="Validate model predictions")
+ parser.add_argument(
+ "--date", type=str, help="Date to validate (YYYY-MM-DD, will look for predictions/DATE.csv)"
+ )
+ parser.add_argument("--predictions", type=Path, help="Path to predictions CSV file")
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Odds database path",
+ )
+ parser.add_argument("--output", type=Path, help="Output path for detailed results CSV")
+ parser.add_argument("--verbose", action="store_true", help="Show game-by-game results")
+
+ args = parser.parse_args()
+
+ # Determine predictions file
+ if args.predictions:
+ predictions_file = args.predictions
+ elif args.date:
+ predictions_file = Path(f"predictions/{args.date}_fresh_calibrated.csv")
+ if not predictions_file.exists():
+ predictions_file = Path(f"predictions/{args.date}_calibrated.csv")
+ if not predictions_file.exists():
+ predictions_file = Path(f"predictions/{args.date}.csv")
+ else:
+ # Default to today
+ today = date.today().isoformat()
+ predictions_file = Path(f"predictions/{today}.csv")
+
+ # Load predictions
+ try:
+ predictions = load_predictions(predictions_file)
+ except FileNotFoundError as e:
+ logger.error(str(e))
+ logger.info("Available prediction files:")
+ pred_dir = Path("predictions")
+ if pred_dir.exists():
+ for f in sorted(pred_dir.glob("*.csv")):
+ logger.info(f" {f}")
+ return
+
+ # Get actual scores
+ db = OddsAPIDatabase(str(args.db_path))
+ results = get_actual_scores(db, predictions)
+
+ if len(results) == 0:
+ logger.error("No completed games found to validate")
+ return
+
+ # Calculate metrics
+ metrics, results_detailed = calculate_metrics(results)
+
+ # Print report
+ print_metrics_report(metrics, results_detailed)
+
+ if args.verbose:
+ print_game_by_game(results_detailed)
+
+ # Save detailed results
+ if args.output:
+ write_csv(results_detailed, str(args.output), index=False)
+ logger.info(f"Saved detailed results to {args.output}")
+ else:
+ # Auto-save
+ output_file = predictions_file.parent / f"{predictions_file.stem}_validation.csv"
+ write_csv(results_detailed, str(output_file), index=False)
+ logger.info(f"Saved detailed results to {output_file}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/analysis/validate_team_matching.py b/scripts/analysis/validate_team_matching.py
new file mode 100644
index 000000000..172e9eafb
--- /dev/null
+++ b/scripts/analysis/validate_team_matching.py
@@ -0,0 +1,290 @@
+#!/usr/bin/env python3
+"""Validate team name matching across all data sources.
+
+Checks every team in today's games against KenPom database and reports:
+- Successful matches
+- Failed matches (no KenPom data)
+- Suspicious matches (fuzzy matched to wrong team)
+- Confidence scores for all matches
+
+Usage:
+ uv run python scripts/validate_team_matching.py
+ uv run python scripts/validate_team_matching.py --fix-mappings
+"""
+
+from __future__ import annotations
+
+import argparse
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+from rich.console import Console
+from rich.table import Table
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.utils.team_matching import (
+ MANUAL_MAPPINGS,
+ find_best_match,
+)
+
+DB_PATH = Path("data/odds_api/odds_api.sqlite3")
+console = Console()
+
+
+def load_kenpom_teams() -> list[str]:
+ """Load all team names from KenPom database."""
+ db = OddsAPIDatabase(str(DB_PATH))
+ df = pd.read_sql_query("SELECT DISTINCT team FROM kp_pomeroy_ratings ORDER BY team", db.conn)
+ return df["team"].tolist()
+
+
+def load_todays_games() -> pd.DataFrame:
+ """Load today's games from analysis file."""
+ today = datetime.now().date().isoformat()
+ analysis_path = Path(f"data/analysis/complete_analysis_{today}_main_lines.csv")
+
+ if not analysis_path.exists():
+ console.print(f"[red]No analysis file found for {today}[/red]")
+ return pd.DataFrame()
+
+ # Note: Using pd.read_csv directly for CSV files (not Parquet)
+ # FilesystemAdapter doesn't have read_csv yet
+ return pd.read_csv(analysis_path)
+
+
+def validate_matches() -> dict[str, list[Any]]:
+ """Validate all team matches and return results."""
+ console.print("\n[bold cyan]TEAM MATCHING VALIDATION[/bold cyan]\n")
+
+ # Load data
+ kenpom_teams = load_kenpom_teams()
+ games = load_todays_games()
+
+ if len(games) == 0:
+ return {}
+
+ console.print(f"[OK] Loaded {len(kenpom_teams)} KenPom teams")
+ console.print(f"[OK] Loaded {len(games)} games\n")
+
+ # Get all unique teams
+ all_teams: set[str] = set()
+ if "away_team" in games.columns:
+ all_teams.update(games["away_team"].dropna().unique())
+ if "home_team" in games.columns:
+ all_teams.update(games["home_team"].dropna().unique())
+
+ console.print(f"[OK] Found {len(all_teams)} unique teams\n")
+
+ # Validate each team
+ results: dict[str, list[Any]] = {
+ "exact_manual": [],
+ "fuzzy_high": [],
+ "fuzzy_medium": [],
+ "fuzzy_low": [],
+ "failed": [],
+ "suspicious": [],
+ }
+
+ for team in sorted(all_teams):
+ # Check manual mapping
+ if team in MANUAL_MAPPINGS:
+ kp_name = MANUAL_MAPPINGS[team]
+ if kp_name in kenpom_teams:
+ results["exact_manual"].append((team, kp_name, 1.0))
+ continue
+ else:
+ results["failed"].append((team, f"Manual mapping broken: {kp_name}"))
+ continue
+
+ # Try fuzzy matching
+ match, score = find_best_match(team, kenpom_teams, threshold=0.0)
+
+ if match is None or score < 0.85:
+ results["failed"].append((team, f"No match (score: {score:.2f})"))
+ elif score >= 0.95:
+ results["fuzzy_high"].append((team, match, score))
+ elif score >= 0.90:
+ results["fuzzy_medium"].append((team, match, score))
+ # Check for suspicious matches
+ if _is_suspicious_match(team, match):
+ results["suspicious"].append((team, match, score))
+ else:
+ results["fuzzy_low"].append((team, match, score))
+ results["suspicious"].append((team, match, score))
+
+ return results
+
+
+def _is_suspicious_match(source: str, kenpom: str) -> bool:
+ """Check if match is suspicious (common false positives)."""
+ suspicious_pairs = [
+ ("Ohio", "Ohio St."),
+ ("Miami", "Miami FL"),
+ ("Miami", "Miami OH"),
+ ("Western", "Northwestern"),
+ ("Eastern", "Northeastern"),
+ ("Central", "North Central"),
+ ("Southern", "Northwestern"),
+ ("Illinois", "Illinois Chicago"),
+ ("Indiana", "Indiana St."),
+ ]
+
+ source_lower = source.lower()
+ kenpom_lower = kenpom.lower()
+
+ for s, k in suspicious_pairs:
+ if s.lower() in source_lower and k.lower() in kenpom_lower and source_lower != k.lower():
+ return True
+
+ return False
+
+
+def display_results(results: dict[str, list[Any]]) -> None:
+ """Display validation results in rich tables."""
+ console.print("\n" + "=" * 80)
+ console.print("[bold]VALIDATION RESULTS[/bold]")
+ console.print("=" * 80 + "\n")
+
+ # Exact manual mappings
+ if results["exact_manual"]:
+ table = Table(title="[green]Exact Manual Mappings[/green]", show_header=True)
+ table.add_column("Source Name", style="cyan")
+ table.add_column("KenPom Name", style="green")
+ table.add_column("Score", justify="right")
+
+ for source, kenpom, score in results["exact_manual"]:
+ table.add_row(source, kenpom, f"{score:.2f}")
+
+ console.print(table)
+ console.print(f"\n[green][OK] {len(results['exact_manual'])} exact matches[/green]\n")
+
+ # High confidence fuzzy
+ if results["fuzzy_high"]:
+ table = Table(title="[green]High Confidence Fuzzy Matches (0.95+)[/green]")
+ table.add_column("Source Name", style="cyan")
+ table.add_column("KenPom Name", style="green")
+ table.add_column("Score", justify="right")
+
+ for source, kenpom, score in results["fuzzy_high"]:
+ table.add_row(source, kenpom, f"{score:.2f}")
+
+ console.print(table)
+ high_count = len(results["fuzzy_high"])
+ console.print(f"\n[green][OK] {high_count} high confidence matches[/green]\n")
+
+ # Medium confidence fuzzy
+ if results["fuzzy_medium"]:
+ table = Table(title="[yellow]Medium Confidence Fuzzy Matches (0.90-0.95)[/yellow]")
+ table.add_column("Source Name", style="cyan")
+ table.add_column("KenPom Name", style="yellow")
+ table.add_column("Score", justify="right")
+
+ for source, kenpom, score in results["fuzzy_medium"]:
+ table.add_row(source, kenpom, f"{score:.2f}")
+
+ console.print(table)
+ med_count = len(results["fuzzy_medium"])
+ console.print(f"\n[yellow][WARN] {med_count} medium confidence matches[/yellow]\n")
+
+ # Low confidence fuzzy
+ if results["fuzzy_low"]:
+ table = Table(title="[red]Low Confidence Fuzzy Matches (0.85-0.90)[/red]")
+ table.add_column("Source Name", style="cyan")
+ table.add_column("KenPom Name", style="red")
+ table.add_column("Score", justify="right")
+
+ for source, kenpom, score in results["fuzzy_low"]:
+ table.add_row(source, kenpom, f"{score:.2f}")
+
+ console.print(table)
+ console.print(f"\n[red][WARN] {len(results['fuzzy_low'])} low confidence matches[/red]\n")
+
+ # Suspicious matches
+ if results["suspicious"]:
+ table = Table(title="[bold red]SUSPICIOUS MATCHES - LIKELY ERRORS[/bold red]")
+ table.add_column("Source Name", style="cyan")
+ table.add_column("Matched To", style="red")
+ table.add_column("Score", justify="right")
+
+ for source, kenpom, score in results["suspicious"]:
+ table.add_row(source, kenpom, f"{score:.2f}")
+
+ console.print(table)
+ suspicious_count = len(results["suspicious"])
+ console.print(
+ f"\n[bold red][ERROR] {suspicious_count} SUSPICIOUS matches found![/bold red]\n"
+ )
+
+ # Failed matches
+ if results["failed"]:
+ table = Table(title="[bold red]FAILED MATCHES[/bold red]")
+ table.add_column("Source Name", style="cyan")
+ table.add_column("Reason", style="red")
+
+ for source, reason in results["failed"]:
+ table.add_row(source, reason)
+
+ console.print(table)
+ console.print(f"\n[bold red][ERROR] {len(results['failed'])} failed matches![/bold red]\n")
+
+ # Summary
+ console.print("=" * 80)
+ console.print("[bold]SUMMARY[/bold]")
+ console.print("=" * 80)
+
+ total_teams = (
+ len(results["exact_manual"])
+ + len(results["fuzzy_high"])
+ + len(results["fuzzy_medium"])
+ + len(results["fuzzy_low"])
+ + len(results["failed"])
+ )
+
+ console.print(f"\nTotal teams validated: {total_teams}")
+ exact_high = len(results["exact_manual"]) + len(results["fuzzy_high"])
+ console.print(f"[green]Exact/High confidence: {exact_high}[/green]")
+ console.print(f"[yellow]Medium confidence: {len(results['fuzzy_medium'])}[/yellow]")
+ console.print(f"[red]Low confidence: {len(results['fuzzy_low'])}[/red]")
+ console.print(f"[bold red]Suspicious: {len(results['suspicious'])}[/bold red]")
+ console.print(f"[bold red]Failed: {len(results['failed'])}[/bold red]")
+
+ if results["suspicious"] or results["failed"] or results["fuzzy_low"]:
+ console.print(
+ "\n[bold red][ERROR] Team matching has issues - DO NOT use for betting![/bold red]"
+ )
+ console.print("Add manual mappings to fix suspicious/failed matches.\n")
+ else:
+ console.print("\n[bold green][OK] All teams matched successfully![/bold green]\n")
+
+
+def main() -> int:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(description="Validate team name matching")
+ parser.add_argument(
+ "--fix-mappings",
+ action="store_true",
+ help="Generate manual mapping entries for failed/suspicious matches",
+ )
+ args = parser.parse_args()
+
+ results = validate_matches()
+
+ if not results:
+ return 1
+
+ display_results(results)
+
+ if args.fix_mappings:
+ console.print("\n[bold cyan]SUGGESTED MANUAL MAPPINGS:[/bold cyan]\n")
+ for source, kenpom, score in results["suspicious"]:
+ console.print(f' "{source}": "{kenpom}", # VERIFY THIS - score: {score:.2f}')
+ for source, reason in results["failed"]:
+ console.print(f' "{source}": "???", # {reason}')
+
+ return 0
+
+
+if __name__ == "__main__":
+ exit(main())
diff --git a/scripts/analysis/wvu_ats_analysis.py b/scripts/analysis/wvu_ats_analysis.py
new file mode 100644
index 000000000..4f46e5ae7
--- /dev/null
+++ b/scripts/analysis/wvu_ats_analysis.py
@@ -0,0 +1,315 @@
+"""WVU ATS analysis for Texas Tech matchup."""
+
+from __future__ import annotations
+
+import sqlite3
+
+import pandas as pd
+
+
+def main() -> None:
+ conn = sqlite3.connect("data/odds_api/odds_api.sqlite3")
+
+ # Get all WVU games this season with scores
+ wvu_games = pd.read_sql_query(
+ """
+ SELECT e.event_id, e.commence_time, e.home_team, e.away_team,
+ s.home_score, s.away_score, s.completed
+ FROM events e
+ LEFT JOIN scores s ON e.event_id = s.event_id
+ WHERE e.sport_key = 'basketball_ncaab'
+ AND (e.home_team LIKE '%West Virginia%'
+ OR e.away_team LIKE '%West Virginia%')
+ AND e.commence_time >= '2024-11-01'
+ ORDER BY e.commence_time
+ """,
+ conn,
+ )
+
+ results = []
+ for _, g in wvu_games.iterrows():
+ eid = g["event_id"]
+
+ # Get closing spread from FanDuel
+ spread_data = pd.read_sql_query(
+ f"""
+ SELECT outcome_name, point, price_american
+ FROM observations
+ WHERE event_id = '{eid}'
+ AND market_key = 'spreads'
+ AND book_key = 'fanduel'
+ ORDER BY fetched_at DESC
+ LIMIT 2
+ """,
+ conn,
+ )
+
+ # Get closing total from FanDuel
+ total_data = pd.read_sql_query(
+ f"""
+ SELECT point
+ FROM observations
+ WHERE event_id = '{eid}'
+ AND market_key = 'totals'
+ AND book_key = 'fanduel'
+ AND outcome_name = 'Over'
+ ORDER BY fetched_at DESC
+ LIMIT 1
+ """,
+ conn,
+ )
+
+ is_home = "West Virginia" in g["home_team"]
+ opp = g["away_team"] if is_home else g["home_team"]
+ wvu_score = g["home_score"] if is_home else g["away_score"]
+ opp_score = g["away_score"] if is_home else g["home_score"]
+
+ # Parse WVU spread
+ wvu_spread = None
+ if len(spread_data) > 0:
+ for _, sd in spread_data.iterrows():
+ if "West Virginia" in sd["outcome_name"]:
+ wvu_spread = sd["point"]
+ break
+ if wvu_spread is None:
+ for _, sd in spread_data.iterrows():
+ if "West Virginia" not in sd["outcome_name"]:
+ wvu_spread = -sd["point"]
+ break
+
+ close_total = total_data.iloc[0]["point"] if len(total_data) > 0 else None
+
+ # Clean opponent name
+ opp_clean = (
+ opp.replace(" Mountaineers", "")
+ .replace("West Virginia ", "")
+ .replace(" Red Raiders", "")
+ .replace(" Wildcats", "")
+ .replace(" Jayhawks", "")
+ .replace(" Buffaloes", "")
+ .replace(" Bears", "")
+ .replace(" Cyclones", "")
+ .replace(" Sooners", "")
+ .replace(" Cougars", "")
+ .replace(" Longhorns", "")
+ .replace(" Horned Frogs", "")
+ .replace(" Cowboys", "")
+ .replace(" Sun Devils", "")
+ .replace(" Bearcats", "")
+ .replace(" Aggies", "")
+ .replace(" Spartans", "")
+ .replace(" Rams", "")
+ .replace(" Fighting Illini", "")
+ .replace(" Golden Eagles", "")
+ )
+
+ results.append(
+ {
+ "date": g["commence_time"][:10],
+ "opp": opp_clean[:20],
+ "ha": "H" if is_home else "A",
+ "wvu": wvu_score,
+ "opp_sc": opp_score,
+ "spread": wvu_spread,
+ "total": close_total,
+ "completed": g["completed"],
+ }
+ )
+
+ conn.close()
+
+ df = pd.DataFrame(results)
+ scored = df[df["wvu"].notna()].copy()
+ scored["wvu"] = scored["wvu"].astype(float)
+ scored["opp_sc"] = scored["opp_sc"].astype(float)
+ scored["margin"] = scored["wvu"] - scored["opp_sc"]
+ scored["won"] = scored["margin"] > 0
+ scored["act_total"] = scored["wvu"] + scored["opp_sc"]
+
+ # ATS calc
+ with_spread = scored[scored["spread"].notna()].copy()
+ with_spread["ats_margin"] = with_spread["margin"] + with_spread["spread"]
+ with_spread["covered"] = with_spread["ats_margin"] > 0
+ with_spread["push"] = with_spread["ats_margin"] == 0
+
+ # O/U calc
+ with_total = with_spread[with_spread["total"].notna()].copy()
+ with_total["went_over"] = with_total["act_total"] > with_total["total"]
+
+ # Print game log
+ hdr = "=" * 105
+ print(hdr)
+ print(" WEST VIRGINIA MOUNTAINEERS - 2025-26 GAME LOG")
+ print(hdr)
+ print()
+ header = (
+ f"{'Date':12s} {'':4s} {'Opponent':20s} {'WVU':>5s} {'Opp':>5s} "
+ f"{'Margin':>7s} {'W/L':>4s} {'Spread':>7s} {'ATS':>4s} "
+ f"{'Total':>6s} {'O/U':>4s} {'ActTot':>7s}"
+ )
+ print(header)
+ print("-" * 105)
+
+ for _, r in scored.iterrows():
+ wl = "W" if r["won"] else "L"
+ sp_str = f"{r['spread']:+.1f}" if pd.notna(r["spread"]) else " N/A"
+
+ # ATS
+ if pd.notna(r["spread"]):
+ am = r["margin"] + r["spread"]
+ ats = "W" if am > 0 else ("P" if am == 0 else "L")
+ else:
+ ats = "-"
+
+ # O/U
+ if pd.notna(r.get("total")):
+ t_str = f"{r['total']:.0f}"
+ if r["act_total"] > r["total"]:
+ ou = "O"
+ elif r["act_total"] < r["total"]:
+ ou = "U"
+ else:
+ ou = "P"
+ else:
+ t_str = "N/A"
+ ou = "-"
+
+ print(
+ f"{r['date']:12s} {r['ha']:4s} {r['opp']:20s} "
+ f"{r['wvu']:5.0f} {r['opp_sc']:5.0f} {r['margin']:+7.0f} "
+ f"{wl:>4s} {sp_str:>7s} {ats:>4s} {t_str:>6s} {ou:>4s} "
+ f"{r['act_total']:7.0f}"
+ )
+
+ # Summaries
+ print()
+ print(hdr)
+ print(" SUMMARY STATS")
+ print(hdr)
+
+ wins = int(scored["won"].sum())
+ losses = len(scored) - wins
+ print(f" Overall: {wins}-{losses}")
+
+ if len(with_spread) > 0:
+ ats_w = int(with_spread["covered"].sum())
+ ats_l = int((~with_spread["covered"] & ~with_spread["push"]).sum())
+ ats_p = int(with_spread["push"].sum())
+ pct = ats_w / (ats_w + ats_l) * 100 if (ats_w + ats_l) > 0 else 0
+ print(f" ATS: {ats_w}-{ats_l}-{ats_p} ({pct:.1f}%)")
+
+ if len(with_total) > 0:
+ overs = int(with_total["went_over"].sum())
+ unders = len(with_total) - overs
+ print(f" O/U: {overs} Over, {unders} Under")
+ print(f" Avg Actual Total: {with_total['act_total'].mean():.1f}")
+
+ # Home splits
+ home = scored[scored["ha"] == "H"]
+ away = scored[scored["ha"] == "A"]
+ home_ws = with_spread[with_spread.index.isin(home.index)]
+ away_ws = with_spread[with_spread.index.isin(away.index)]
+
+ print()
+ print(" --- HOME ---")
+ hw = int(home["won"].sum())
+ hl = len(home) - hw
+ print(f" SU: {hw}-{hl}")
+ if len(home_ws) > 0:
+ haw = int(home_ws["covered"].sum())
+ hal = int((~home_ws["covered"] & ~home_ws["push"]).sum())
+ print(f" ATS: {haw}-{hal}")
+ print(f" Avg Margin: {home['margin'].mean():+.1f}")
+
+ print()
+ print(" --- AWAY ---")
+ aw = int(away["won"].sum())
+ al_ = len(away) - aw
+ print(f" SU: {aw}-{al_}")
+ if len(away_ws) > 0:
+ aaw = int(away_ws["covered"].sum())
+ aal = int((~away_ws["covered"] & ~away_ws["push"]).sum())
+ print(f" ATS: {aaw}-{aal}")
+ print(f" Avg Margin: {away['margin'].mean():+.1f}")
+
+ # As underdog
+ dog = with_spread[with_spread["spread"] > 0]
+ fav = with_spread[with_spread["spread"] < 0]
+
+ print()
+ print(" --- AS UNDERDOG ---")
+ if len(dog) > 0:
+ dw = int(dog["covered"].sum())
+ dl = int((~dog["covered"] & ~dog["push"]).sum())
+ dsu_w = int((dog["margin"] > 0).sum())
+ dsu_l = len(dog) - dsu_w
+ print(f" Games: {len(dog)}")
+ print(f" SU: {dsu_w}-{dsu_l} (outright wins)")
+ print(f" ATS: {dw}-{dl}")
+ print(f" Avg Spread: +{dog['spread'].mean():.1f}")
+ print(f" Avg Margin: {dog['margin'].mean():+.1f}")
+ else:
+ print(" No games as underdog")
+
+ print()
+ print(" --- AS FAVORITE ---")
+ if len(fav) > 0:
+ fw = int(fav["covered"].sum())
+ fl = int((~fav["covered"] & ~fav["push"]).sum())
+ print(f" Games: {len(fav)}")
+ print(f" ATS: {fw}-{fl}")
+ print(f" Avg Spread: {fav['spread'].mean():.1f}")
+ else:
+ print(" No games as favorite")
+
+ # Last 5
+ print()
+ print(" --- LAST 5 GAMES ---")
+ last5 = scored.tail(5)
+ l5w = int(last5["won"].sum())
+ print(f" SU: {l5w}-{5 - l5w}")
+ print(f" Avg Margin: {last5['margin'].mean():+.1f}")
+ print(f" Avg Total: {last5['act_total'].mean():.1f}")
+ l5_ws = with_spread[with_spread.index.isin(last5.index)]
+ if len(l5_ws) > 0:
+ l5aw = int(l5_ws["covered"].sum())
+ l5al = int((~l5_ws["covered"] & ~l5_ws["push"]).sum())
+ print(f" ATS: {l5aw}-{l5al}")
+
+ # Conference play only
+ print()
+ print(" --- BIG 12 PLAY ONLY ---")
+ b12_teams = [
+ "Arizona",
+ "Arizona St",
+ "Baylor",
+ "BYU",
+ "Cincinnati",
+ "Colorado",
+ "Houston",
+ "Iowa St",
+ "Kansas",
+ "Kansas St",
+ "Oklahoma St",
+ "TCU",
+ "Texas Tech",
+ "UCF",
+ "Utah",
+ "West Virginia",
+ ]
+ conf = scored[scored["opp"].apply(lambda x: any(t in x for t in b12_teams))]
+ if len(conf) > 0:
+ cw = int(conf["won"].sum())
+ cl_ = len(conf) - cw
+ print(f" SU: {cw}-{cl_}")
+ print(f" Avg Margin: {conf['margin'].mean():+.1f}")
+ print(f" Avg Total: {conf['act_total'].mean():.1f}")
+ conf_ws = with_spread[with_spread.index.isin(conf.index)]
+ if len(conf_ws) > 0:
+ caw = int(conf_ws["covered"].sum())
+ cal_ = int((~conf_ws["covered"] & ~conf_ws["push"]).sum())
+ print(f" ATS: {caw}-{cal_}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/README.md b/scripts/archive/2026-02-deprecated/README.md
new file mode 100644
index 000000000..d89ccead9
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/README.md
@@ -0,0 +1,96 @@
+# Deprecated Scripts Archive (February 2026)
+
+Scripts moved here during the February 2026 cleanup to reduce redundancy and improve architectural consistency.
+
+## Why These Scripts Were Archived
+
+This archive contains 30+ scripts that were deprecated due to:
+- **Redundancy**: Multiple scripts doing the same thing
+- **Architecture violations**: Direct I/O instead of using adapters/services
+- **One-off utilities**: Scripts used once for data fixes
+- **Demo/test scripts**: Example code that should be in examples/
+
+## What Was Archived
+
+### Redundant Overtime Collection (5 scripts)
+Consolidated to `overtime_collector_service.py` in main scripts directory.
+
+- `collect_overtime_realtime.py` - SignalR real-time (functionality in service)
+- `collect_overtime_scheduled.py` - Scheduled snapshots (functionality in service)
+- `collect_overtime_odds_csv.py` - CSV output variant
+- `track_overtime_daily.py` - Daily tracking variant
+- `overtime_ag_scraper.py` - Demo script for REST + WebSocket
+
+### Redundant Live Monitors (4 scripts)
+Should be consolidated to single `monitor_live_odds.py` with configurable data sources.
+
+- `live_line_monitor_overtime.py` - Overtime variant
+- `monitor_live_lines.py` - Generic monitor
+- `monitor_live_rich.py` - Rich console variant
+- `monitor_live_with_logging.py` - Logging variant
+
+### Redundant Team Mapping (2 scripts)
+Consolidated to `build_team_mapping.py` in main scripts directory.
+
+- `create_team_mapping.py` - Alternative creation approach
+- `rebuild_team_mapping.py` - Rebuilds mapping (same as build)
+
+### Redundant Dataset Builders (3 scripts)
+Consolidated to `build_training_datasets.py` which uses services layer.
+
+- `build_datasets_espn_odds.py` - Direct approach without services
+- `build_dataset_comprehensive.py` - Comprehensive variant
+- `export_complete_dataset.py` - Export variant
+
+### Redundant Training Scripts (2 scripts)
+Consolidated to `train_spreads_model.py` and `train_totals_model.py`.
+
+- `train_spreads_basic_model.py` - Basic approach (full model supports all cases)
+- `walk_forward_training_fast.py` - Fast variant (use train_walkforward.py)
+
+### Demo/Test Scripts (5 scripts)
+Example code that doesn't belong in production scripts directory.
+
+- `demo_bookmaker_accuracy.py` - Demo of bookmaker analysis
+- `demo_market_features.py` - Demo of market features
+- `demo_tracker.py` - Demo of tracker
+- `test_simple.py` - Simple test script
+- `backfill_example.py` - Example backfill script
+
+### One-off/Fix Scripts (4 scripts)
+Scripts used once for data fixes or migrations.
+
+- `fix_team_mapping.py` - One-time mapping fix
+- `fix_complete_analysis.py` - One-time analysis fix
+- `add_dates_to_training_data.py` - One-time data migration
+- `force_update_views.py` - Database view update utility
+
+### Additional Cleanup (5 scripts)
+- `comprehensive_odds_with_overtime.py` - Redundant collection script
+- `save_overtime_snapshot.py` - Functionality should be in service
+- `manual_overtime_entry.py` - One-off manual entry utility
+- `verify_odds_streaming.py` - Test/verification script
+- `view_collected_odds.py` - Utility for viewing data
+
+## Migration Notes
+
+### If you need functionality from archived scripts:
+
+1. **Overtime Collection**: Use `overtime_collector_service.py`
+2. **Live Monitoring**: Use `live_line_monitor.py` (TODO: refactor to use adapters)
+3. **Team Mapping**: Use `build_team_mapping.py`
+4. **Dataset Building**: Use `build_training_datasets.py`
+5. **Model Training**: Use `train_spreads_model.py` or `train_totals_model.py`
+
+### Restoration
+
+If you need to restore a script:
+```bash
+cp scripts/archive/2026-02-deprecated/.py scripts/
+```
+
+## Cleanup Date
+
+**Archived**: 2026-02-04
+**Archived By**: Claude Code + Andy
+**Total Scripts Archived**: 30+
diff --git a/scripts/archive/2026-02-deprecated/add_dates_to_training_data.py b/scripts/archive/2026-02-deprecated/add_dates_to_training_data.py
new file mode 100644
index 000000000..efd9726c6
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/add_dates_to_training_data.py
@@ -0,0 +1,97 @@
+#!/usr/bin/env python3
+"""Fast approach: Add dates to existing training data for walk-forward validation.
+
+Uses the existing high-quality training datasets and adds game dates
+from the database, enabling temporal validation without rebuilding everything.
+"""
+
+import sqlite3
+from pathlib import Path
+
+import pandas as pd
+
+DB_PATH = Path("data/odds_api/odds_api.sqlite3")
+SPREADS_PATH = Path("data/ml/spreads_2025-12-28_2026-02-01.parquet")
+TOTALS_PATH = Path("data/ml/totals_2025-12-28_2026-02-01.parquet")
+
+print("=" * 80)
+print("FAST APPROACH - ADD DATES TO EXISTING TRAINING DATA")
+print("=" * 80)
+
+# Load existing training data
+print("\n[1/3] Loading existing training datasets...")
+spreads_df = pd.read_parquet(SPREADS_PATH)
+totals_df = pd.read_parquet(TOTALS_PATH)
+
+print(f" Spreads: {len(spreads_df)} games, {len(spreads_df.columns)} features")
+print(f" Totals: {len(totals_df)} games, {len(totals_df.columns)} features")
+
+# Connect to database
+print("\n[2/3] Extracting game dates from database...")
+conn = sqlite3.connect(DB_PATH)
+
+# Get all games with dates
+games_query = """
+SELECT
+ espn_event_id,
+ game_date,
+ away_team,
+ home_team
+FROM espn_scores
+WHERE game_date >= '2025-12-28'
+ AND game_date <= '2026-02-01'
+ AND completed = 1
+ORDER BY game_date
+"""
+
+games_with_dates = pd.read_sql_query(games_query, conn)
+conn.close()
+
+print(f" [OK] Found {len(games_with_dates)} games with dates")
+
+# Create synthetic dates for existing data (evenly distributed)
+print("\n[3/3] Adding dates to training data...")
+
+# For spreads dataset - distribute evenly across date range
+min_date = pd.to_datetime("2025-12-28")
+max_date = pd.to_datetime("2026-02-01")
+date_range = (max_date - min_date).days
+
+spreads_df["game_date"] = [
+ (min_date + pd.Timedelta(days=int(i * date_range / len(spreads_df)))).strftime("%Y-%m-%d")
+ for i in range(len(spreads_df))
+]
+
+# For totals dataset
+totals_df["game_date"] = [
+ (min_date + pd.Timedelta(days=int(i * date_range / len(totals_df)))).strftime("%Y-%m-%d")
+ for i in range(len(totals_df))
+]
+
+print(" [OK] Added dates to spreads dataset")
+print(" [OK] Added dates to totals dataset")
+
+# Save enhanced datasets
+spreads_output = Path("data/ml/spreads_with_dates_2026.parquet")
+totals_output = Path("data/ml/totals_with_dates_2026.parquet")
+
+spreads_df.to_parquet(spreads_output, index=False)
+totals_df.to_parquet(totals_output, index=False)
+
+print(f"\n[SAVED] {spreads_output}")
+print(f"[SAVED] {totals_output}")
+
+# Summary
+print("\n" + "=" * 80)
+print("READY FOR WALK-FORWARD VALIDATION")
+print("=" * 80)
+print(f"\nSpreads: {len(spreads_df)} games")
+print(f" Date range: {spreads_df['game_date'].min()} to {spreads_df['game_date'].max()}")
+print(f" Features: {len(spreads_df.columns) - 1} (excluding target)")
+
+print(f"\nTotals: {len(totals_df)} games")
+print(f" Date range: {totals_df['game_date'].min()} to {totals_df['game_date'].max()}")
+print(f" Features: {len(totals_df.columns) - 1} (excluding target)")
+
+print("\n[NEXT STEP] Run walk-forward training:")
+print(" uv run python scripts/walk_forward_training.py")
diff --git a/scripts/archive/2026-02-deprecated/backfill_example.py b/scripts/archive/2026-02-deprecated/backfill_example.py
new file mode 100644
index 000000000..d6e6e041d
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/backfill_example.py
@@ -0,0 +1,7 @@
+def main() -> int:
+ print("Backfill placeholder")
+ return 0
+
+
+if __name__ == "__main__":
+ raise SystemExit(main())
diff --git a/scripts/archive/2026-02-deprecated/build_dataset_comprehensive.py b/scripts/archive/2026-02-deprecated/build_dataset_comprehensive.py
new file mode 100644
index 000000000..61436b032
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/build_dataset_comprehensive.py
@@ -0,0 +1,319 @@
+#!/usr/bin/env python3
+"""Build comprehensive ML dataset with proper dates and KenPom features."""
+
+import sqlite3
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+
+DB_PATH = Path("data/odds_api/odds_api.sqlite3")
+OUTPUT_DIR = Path("data/ml")
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+
+print("=" * 80)
+print("COMPREHENSIVE DATASET BUILDER")
+print("=" * 80)
+
+conn = sqlite3.connect(DB_PATH)
+
+# Step 1: Get completed games
+print("\n[1/5] Loading completed games with scores...")
+games_df = pd.read_sql_query(
+ """
+ SELECT
+ espn_event_id, game_date, commence_time,
+ away_team, home_team, away_score, home_score
+ FROM espn_scores
+ WHERE completed = 1
+ AND away_score IS NOT NULL
+ AND home_score IS NOT NULL
+ ORDER BY game_date
+ """,
+ conn,
+)
+print(
+ f" [OK] {len(games_df):,} games | "
+ f"{games_df['game_date'].min()} to {games_df['game_date'].max()}"
+)
+
+# Step 2: Match to odds events
+print("\n[2/5] Matching games to odds events...")
+events_df = pd.read_sql_query(
+ """
+ SELECT DISTINCT
+ event_id,
+ away_team,
+ home_team,
+ DATE(commence_time) as event_date
+ FROM events
+ WHERE has_odds = 1
+ """,
+ conn,
+)
+
+merged = games_df.merge(
+ events_df,
+ left_on=["away_team", "home_team", "game_date"],
+ right_on=["away_team", "home_team", "event_date"],
+ how="inner",
+)
+print(f" [OK] Matched {len(merged):,} games to odds events")
+
+# Step 3: Get closing lines
+print("\n[3/5] Loading closing lines (FanDuel)...")
+closing_df = pd.read_sql_query(
+ """
+ SELECT
+ o.event_id,
+ o.market_key,
+ o.outcome_name,
+ o.point as closing_line,
+ o.price_american as closing_juice
+ FROM observations o
+ INNER JOIN (
+ SELECT
+ event_id,
+ market_key,
+ outcome_name,
+ MAX(as_of) as last_seen
+ FROM observations
+ WHERE book_key = 'fanduel'
+ AND market_key IN ('spreads', 'totals')
+ GROUP BY event_id, market_key, outcome_name
+ ) last
+ ON o.event_id = last.event_id
+ AND o.market_key = last.market_key
+ AND o.outcome_name = last.outcome_name
+ AND o.as_of = last.last_seen
+ WHERE o.book_key = 'fanduel'
+ """,
+ conn,
+)
+print(f" [OK] {len(closing_df):,} closing line observations")
+
+# Step 4: Load KenPom ratings (FIXED columns)
+print("\n[4/5] Loading KenPom ratings...")
+kenpom_df = pd.read_sql_query(
+ """
+ SELECT
+ team,
+ adj_em,
+ adj_o,
+ adj_d,
+ adj_t,
+ luck,
+ rank
+ FROM kp_pomeroy_ratings
+ WHERE season = 2026
+ """,
+ conn,
+)
+print(f" [OK] {len(kenpom_df)} teams")
+
+# Step 5: Load Four Factors
+print("\n[5/5] Loading Four Factors...")
+try:
+ ff_df = pd.read_sql_query(
+ """
+ SELECT
+ team,
+ efg_pct,
+ to_pct,
+ or_pct,
+ ftrate,
+ efg_pct_d,
+ to_pct_d,
+ or_pct_d,
+ ftrate_d
+ FROM kp_four_factors
+ WHERE season = 2026
+ """,
+ conn,
+ )
+ print(f" [OK] {len(ff_df)} teams")
+except Exception as e:
+ print(f" [WARN] Four Factors error: {e}")
+ ff_df = None
+
+conn.close()
+
+# Build dataset
+print("\n[6/6] Building comprehensive dataset...")
+
+
+def normalize_name(name):
+ """Extract core team name for matching."""
+ return str(name).split()[-1] if name else ""
+
+
+def match_kenpom(team_name, kp_df):
+ """Match team to KenPom data."""
+ core = normalize_name(team_name)
+ match = kp_df[kp_df["team"].str.contains(core, case=False, na=False)]
+ return match.iloc[0].to_dict() if not match.empty else None
+
+
+dataset = []
+for _, game in merged.iterrows():
+ event_id = game["event_id"]
+
+ # Get spreads
+ spreads = closing_df[
+ (closing_df["event_id"] == event_id) & (closing_df["market_key"] == "spreads")
+ ]
+
+ home_spread = spreads[
+ spreads["outcome_name"].str.contains(
+ normalize_name(game["home_team"]), case=False, na=False
+ )
+ ]
+
+ if home_spread.empty:
+ continue
+
+ closing_spread = home_spread.iloc[0]["closing_line"]
+
+ # Get totals
+ totals = closing_df[
+ (closing_df["event_id"] == event_id) & (closing_df["market_key"] == "totals")
+ ]
+ over_row = totals[totals["outcome_name"] == "Over"]
+ closing_total = over_row.iloc[0]["closing_line"] if not over_row.empty else None
+
+ # Match KenPom
+ away_kp = match_kenpom(game["away_team"], kenpom_df)
+ home_kp = match_kenpom(game["home_team"], kenpom_df)
+
+ if not away_kp or not home_kp:
+ continue
+
+ # Build feature dict
+ features = {
+ "game_date": game["game_date"],
+ "away_team": game["away_team"],
+ "home_team": game["home_team"],
+ "away_score": game["away_score"],
+ "home_score": game["home_score"],
+ "closing_spread": closing_spread,
+ "closing_total": closing_total,
+ "away_adj_em": away_kp["adj_em"],
+ "away_adj_o": away_kp["adj_o"],
+ "away_adj_d": away_kp["adj_d"],
+ "away_adj_t": away_kp["adj_t"],
+ "away_luck": away_kp["luck"],
+ "home_adj_em": home_kp["adj_em"],
+ "home_adj_o": home_kp["adj_o"],
+ "home_adj_d": home_kp["adj_d"],
+ "home_adj_t": home_kp["adj_t"],
+ "home_luck": home_kp["luck"],
+ "kenpom_margin": home_kp["adj_em"] - away_kp["adj_em"],
+ "tempo_avg": (away_kp["adj_t"] + home_kp["adj_t"]) / 2,
+ }
+
+ # Add Four Factors if available
+ if ff_df is not None:
+ away_ff = match_kenpom(game["away_team"], ff_df)
+ home_ff = match_kenpom(game["home_team"], ff_df)
+ if away_ff and home_ff:
+ for col in ["efg_pct", "to_pct", "or_pct", "ftrate", "efg_pct_d", "to_pct_d"]:
+ features[f"away_{col}"] = away_ff.get(col)
+ features[f"home_{col}"] = home_ff.get(col)
+
+ dataset.append(features)
+
+df = pd.DataFrame(dataset)
+print(f" [OK] Built dataset with {len(df):,} games")
+
+# Calculate outcomes
+print("\n[7/7] Calculating outcomes and creating targets...")
+df["actual_margin"] = df["home_score"] - df["away_score"]
+df["favorite_covered"] = np.where(
+ df["closing_spread"] < 0, # Home is favorite
+ df["actual_margin"] > abs(df["closing_spread"]),
+ df["actual_margin"] < -abs(df["closing_spread"]),
+)
+
+df["actual_total"] = df["home_score"] + df["away_score"]
+df["over_hit"] = df["actual_total"] > df["closing_total"]
+
+# Create spreads dataset (favorite/underdog perspective)
+spreads_df = df.copy()
+spreads_df["is_home_fav"] = spreads_df["closing_spread"] < 0
+
+for idx, row in spreads_df.iterrows():
+ if row["is_home_fav"]:
+ spreads_df.at[idx, "fav_adj_em"] = row["home_adj_em"]
+ spreads_df.at[idx, "fav_adj_o"] = row["home_adj_o"]
+ spreads_df.at[idx, "fav_adj_d"] = row["home_adj_d"]
+ spreads_df.at[idx, "fav_adj_t"] = row["home_adj_t"]
+ spreads_df.at[idx, "fav_luck"] = row["home_luck"]
+ spreads_df.at[idx, "dog_adj_em"] = row["away_adj_em"]
+ spreads_df.at[idx, "dog_adj_o"] = row["away_adj_o"]
+ spreads_df.at[idx, "dog_adj_d"] = row["away_adj_d"]
+ spreads_df.at[idx, "dog_adj_t"] = row["away_adj_t"]
+ spreads_df.at[idx, "dog_luck"] = row["away_luck"]
+ else:
+ spreads_df.at[idx, "fav_adj_em"] = row["away_adj_em"]
+ spreads_df.at[idx, "fav_adj_o"] = row["away_adj_o"]
+ spreads_df.at[idx, "fav_adj_d"] = row["away_adj_d"]
+ spreads_df.at[idx, "fav_adj_t"] = row["away_adj_t"]
+ spreads_df.at[idx, "fav_luck"] = row["away_luck"]
+ spreads_df.at[idx, "dog_adj_em"] = row["home_adj_em"]
+ spreads_df.at[idx, "dog_adj_o"] = row["home_adj_o"]
+ spreads_df.at[idx, "dog_adj_d"] = row["home_adj_d"]
+ spreads_df.at[idx, "dog_adj_t"] = row["home_adj_t"]
+ spreads_df.at[idx, "dog_luck"] = row["home_luck"]
+
+spreads_df["em_diff"] = spreads_df["fav_adj_em"] - spreads_df["dog_adj_em"]
+spreads_df["target"] = spreads_df["favorite_covered"].astype(int)
+
+# Create totals dataset (home/away perspective)
+totals_df = df[df["closing_total"].notna()].copy()
+totals_df["target"] = totals_df["over_hit"].astype(int)
+
+# Save datasets
+spreads_output = OUTPUT_DIR / "spreads_comprehensive_2026.parquet"
+totals_output = OUTPUT_DIR / "totals_comprehensive_2026.parquet"
+
+spreads_df.to_parquet(spreads_output, index=False)
+totals_df.to_parquet(totals_output, index=False)
+
+print(f"\n[SAVED] Spreads -> {spreads_output}")
+print(f"[SAVED] Totals -> {totals_output}")
+
+# Summary
+print("\n" + "=" * 80)
+print("DATASET SUMMARY")
+print("=" * 80)
+
+print("\n[SPREADS]")
+print(f" Games: {len(spreads_df):,}")
+print(f" Date range: {spreads_df['game_date'].min()} to {spreads_df['game_date'].max()}")
+print(" Target distribution:")
+spreads_target_sum = spreads_df["target"].sum()
+spreads_target_pct = spreads_df["target"].mean() * 100
+spreads_target_fail = (~spreads_df["target"].astype(bool)).sum()
+spreads_target_fail_pct = (1 - spreads_df["target"].mean()) * 100
+print(f" Favorite covered: {spreads_target_sum} ({spreads_target_pct:.1f}%)")
+print(f" Favorite failed: {spreads_target_fail} ({spreads_target_fail_pct:.1f}%)")
+
+print("\n[TOTALS]")
+print(f" Games: {len(totals_df):,}")
+print(f" Date range: {totals_df['game_date'].min()} to {totals_df['game_date'].max()}")
+print(" Target distribution:")
+totals_target_sum = totals_df["target"].sum()
+totals_target_pct = totals_df["target"].mean() * 100
+totals_target_fail = (~totals_df["target"].astype(bool)).sum()
+totals_target_fail_pct = (1 - totals_df["target"].mean()) * 100
+print(f" Over hit: {totals_target_sum} ({totals_target_pct:.1f}%)")
+print(f" Under hit: {totals_target_fail} ({totals_target_fail_pct:.1f}%)")
+
+print("\n[FEATURES]")
+print(" - KenPom: adj_em, adj_o, adj_d, adj_t, luck")
+print(" - Four Factors: efg_pct, to_pct, or_pct, ftrate (both teams, if available)")
+print(" - Derived: kenpom_margin, tempo_avg, em_diff")
+print(" - Market: closing_spread, closing_total")
+print(" - Dates: game_date (for walk-forward validation)")
+
+print("\n[NEXT STEP] Run walk-forward training with temporal validation")
diff --git a/scripts/archive/2026-02-deprecated/build_datasets_espn_odds.py b/scripts/archive/2026-02-deprecated/build_datasets_espn_odds.py
new file mode 100644
index 000000000..fb8c5334a
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/build_datasets_espn_odds.py
@@ -0,0 +1,364 @@
+"""Build training datasets by matching ESPN results with Odds API historical lines.
+
+Combines:
+- ESPN game results (actual scores)
+- Odds API historical lines (spreads, totals, movement)
+- KenPom efficiency metrics
+- Team name mapping
+
+Usage:
+ # First collect ESPN results:
+ uv run python scripts/collect_espn_season_results.py \\
+ --start 2025-11-01 --end 2026-03-31
+
+ # Then build datasets:
+ uv run python scripts/build_datasets_espn_odds.py \\
+ --espn-data data/espn/season_results_2026.parquet \\
+ --output-dir data/ml
+"""
+
+import argparse
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import write_parquet
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.core.team_mapper import TeamMapper
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def match_espn_to_odds_api(
+ espn_games: pd.DataFrame,
+ odds_db: OddsAPIDatabase,
+ team_mapper: TeamMapper | None,
+) -> pd.DataFrame:
+ """Match ESPN games to Odds API events.
+
+ Args:
+ espn_games: DataFrame with ESPN game results
+ odds_db: Odds API database adapter
+ team_mapper: Team name mapper (or None for direct matching)
+
+ Returns:
+ DataFrame with matched games including event_id from Odds API
+ """
+ logger.info("Matching ESPN games to Odds API events...")
+
+ # Get all Odds API events
+ odds_events = odds_db.get_events_with_scores()
+
+ # Convert game_date strings to dates
+ espn_games["game_date_only"] = pd.to_datetime(espn_games["game_date"]).dt.date
+ odds_events["game_date_only"] = pd.to_datetime(odds_events["commence_time"]).dt.date
+
+ matches = []
+ unmatched_count = 0
+
+ for _, espn_game in espn_games.iterrows():
+ espn_date = espn_game["game_date_only"]
+ espn_home = espn_game["home_team"]
+ espn_away = espn_game["away_team"]
+
+ # Map ESPN names to Odds API names
+ if team_mapper:
+ # Get KenPom names first (canonical)
+ kenpom_home = team_mapper.get_kenpom_name(espn_home, source="espn")
+ kenpom_away = team_mapper.get_kenpom_name(espn_away, source="espn")
+
+ # Then map to Odds API names
+ odds_home = team_mapper.get_odds_api_name(kenpom_home)
+ odds_away = team_mapper.get_odds_api_name(kenpom_away)
+ else:
+ # Direct name matching (fallback)
+ odds_home = espn_home
+ odds_away = espn_away
+
+ # Find matching Odds API event (same date, same teams)
+ candidates = odds_events[odds_events["game_date_only"] == espn_date]
+
+ match = None
+ for _, odds_event in candidates.iterrows():
+ odds_home_team = odds_event["home_team"]
+ odds_away_team = odds_event["away_team"]
+
+ # Check if teams match (either order due to neutral sites)
+ if (odds_home_team == odds_home and odds_away_team == odds_away) or (
+ odds_home_team == odds_away and odds_away_team == odds_home
+ ):
+ match = odds_event
+ break
+
+ if match is not None:
+ # Merge ESPN + Odds API data
+ matched_row = {
+ "espn_game_id": espn_game["game_id"],
+ "event_id": match["event_id"],
+ "game_date": espn_game["game_date"],
+ "home_team": espn_game["home_team"],
+ "away_team": espn_game["away_team"],
+ "home_score": espn_game["home_score"],
+ "away_score": espn_game["away_score"],
+ "odds_api_home": match["home_team"],
+ "odds_api_away": match["away_team"],
+ "neutral_site": espn_game["neutral_site"],
+ "conference_name": espn_game["conference_name"],
+ }
+ matches.append(matched_row)
+ else:
+ unmatched_count += 1
+ logger.debug(
+ f"No Odds API match for ESPN game: {espn_home} vs {espn_away} on {espn_date}"
+ )
+
+ matched_df = pd.DataFrame(matches)
+
+ logger.info(f"Matched {len(matched_df)} games")
+ logger.info(f"Unmatched {unmatched_count} games (no Odds API data)")
+ logger.info(f"Match rate: {len(matched_df) / len(espn_games) * 100:.1f}%")
+
+ return matched_df
+
+
+def build_training_datasets(
+ matched_games: pd.DataFrame,
+ engineer: FeatureEngineer,
+ output_dir: Path,
+) -> None:
+ """Build spreads and totals datasets from matched games.
+
+ Args:
+ matched_games: DataFrame with ESPN + Odds API matched games
+ engineer: FeatureEngineer instance
+ output_dir: Directory to save datasets
+ """
+ logger.info("Building training datasets...")
+
+ # Load KenPom data
+ kenpom_ratings = engineer.load_kenpom_ratings(season=2026)
+ kenpom_ff = engineer.load_kenpom_four_factors(season=2026)
+
+ # Get event IDs for line feature extraction
+ event_ids = matched_games["event_id"].tolist()
+
+ # Get opening/closing lines from Odds API
+ line_features = engineer.odds_db.get_opening_closing_spreads(
+ event_ids=event_ids, book_key="fanduel"
+ )
+
+ # Merge matched games with line features
+ merged = matched_games.merge(line_features, on="event_id", how="inner")
+
+ logger.info(f"Merged {len(merged)} games with line features")
+
+ # Build spreads dataset
+ spreads_features = []
+ spreads_targets = []
+
+ for _, row in merged.iterrows():
+ features = {}
+
+ # Favorite team features
+ fav_features = engineer.get_team_features(
+ row["favorite_team"], kenpom_ratings, kenpom_ff, prefix="fav_"
+ )
+ features.update(fav_features)
+
+ # Underdog team features
+ dog_features = engineer.get_team_features(
+ row["underdog_team"], kenpom_ratings, kenpom_ff, prefix="dog_"
+ )
+ features.update(dog_features)
+
+ # Matchup features
+ if "fav_adj_em" in features and "dog_adj_em" in features:
+ features["em_diff"] = features["fav_adj_em"] - features["dog_adj_em"]
+ if "fav_adj_o" in features and "dog_adj_d" in features:
+ features["fav_o_vs_dog_d"] = features["fav_adj_o"] - features["dog_adj_d"]
+ if "dog_adj_o" in features and "fav_adj_d" in features:
+ features["dog_o_vs_fav_d"] = features["dog_adj_o"] - features["fav_adj_d"]
+
+ # Line features
+ features["opening_spread"] = row.get("opening_spread", None)
+ features["closing_spread"] = row.get("closing_spread", None)
+ features["line_movement"] = row.get("line_movement", 0)
+
+ # Target: Did favorite cover?
+ home_score = row["home_score"]
+ away_score = row["away_score"]
+ home_team = row["odds_api_home"]
+ favorite_team = row["favorite_team"]
+ closing_spread = row["closing_spread"]
+
+ # Determine if favorite is home or away
+ margin = home_score - away_score if home_team == favorite_team else away_score - home_score
+
+ favorite_covered = margin > closing_spread
+
+ spreads_features.append(features)
+ spreads_targets.append(1 if favorite_covered else 0)
+
+ # Build totals dataset
+ totals_list = []
+ for event_id in event_ids:
+ totals = engineer.odds_db.get_canonical_totals(event_id=event_id, book_key="fanduel")
+ if len(totals) > 0:
+ opening_total = totals.iloc[0]["total"]
+ closing_total = totals.iloc[-1]["total"]
+ totals_list.append(
+ {
+ "event_id": event_id,
+ "opening_total": opening_total,
+ "closing_total": closing_total,
+ "total_movement": closing_total - opening_total,
+ }
+ )
+
+ totals_df = pd.DataFrame(totals_list)
+ merged_totals = matched_games.merge(totals_df, on="event_id", how="inner")
+
+ logger.info(f"Merged {len(merged_totals)} games with totals")
+
+ totals_features = []
+ totals_targets = []
+
+ for _, row in merged_totals.iterrows():
+ features = {}
+
+ # Home team features
+ home_features = engineer.get_team_features(
+ row["odds_api_home"], kenpom_ratings, kenpom_ff, prefix="home_"
+ )
+ features.update(home_features)
+
+ # Away team features
+ away_features = engineer.get_team_features(
+ row["odds_api_away"], kenpom_ratings, kenpom_ff, prefix="away_"
+ )
+ features.update(away_features)
+
+ # Tempo features
+ if "home_adj_t" in features and "away_adj_t" in features:
+ features["avg_tempo"] = (features["home_adj_t"] + features["away_adj_t"]) / 2
+ features["tempo_diff"] = abs(features["home_adj_t"] - features["away_adj_t"])
+
+ # Combined offense/defense
+ if "home_adj_o" in features and "away_adj_o" in features:
+ features["total_offense"] = features["home_adj_o"] + features["away_adj_o"]
+ if "home_adj_d" in features and "away_adj_d" in features:
+ features["total_defense"] = features["home_adj_d"] + features["away_adj_d"]
+
+ # Line features
+ features["opening_total"] = row["opening_total"]
+ features["closing_total"] = row["closing_total"]
+ features["total_movement"] = row["total_movement"]
+
+ # Target: Did it go over?
+ actual_total = row["home_score"] + row["away_score"]
+ went_over = actual_total > row["closing_total"]
+
+ totals_features.append(features)
+ totals_targets.append(1 if went_over else 0)
+
+ # Save spreads dataset
+ spreads_df = pd.DataFrame(spreads_features)
+ spreads_df["target"] = spreads_targets
+
+ spreads_output = output_dir / "spreads_espn_odds_2026.parquet"
+ write_parquet(spreads_df.to_dict(orient="records"), spreads_output)
+
+ logger.info(
+ f"[OK] Spreads dataset: {len(spreads_df)} games, {len(spreads_df.columns)} features"
+ )
+ logger.info(f"Saved to {spreads_output}")
+
+ # Save totals dataset
+ totals_df = pd.DataFrame(totals_features)
+ totals_df["target"] = totals_targets
+
+ totals_output = output_dir / "totals_espn_odds_2026.parquet"
+ write_parquet(totals_df.to_dict(orient="records"), totals_output)
+
+ logger.info(f"[OK] Totals dataset: {len(totals_df)} games, {len(totals_df.columns)} features")
+ logger.info(f"Saved to {totals_output}")
+
+ # Show target distributions
+ logger.info("\n=== Dataset Statistics ===")
+ logger.info(
+ f"Spreads - Favorite covered: {sum(spreads_targets)} / {len(spreads_targets)} "
+ f"({sum(spreads_targets) / len(spreads_targets) * 100:.1f}%)"
+ )
+ logger.info(
+ f"Totals - Went over: {sum(totals_targets)} / {len(totals_targets)} "
+ f"({sum(totals_targets) / len(totals_targets) * 100:.1f}%)"
+ )
+
+
+def main() -> None:
+ """Build training datasets from ESPN + Odds API."""
+ parser = argparse.ArgumentParser(
+ description="Build training datasets from ESPN results + Odds API lines"
+ )
+ parser.add_argument(
+ "--espn-data",
+ type=Path,
+ required=True,
+ help="Path to ESPN season results parquet",
+ )
+ parser.add_argument(
+ "--odds-db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to Odds API database",
+ )
+ parser.add_argument(
+ "--kenpom-path",
+ type=Path,
+ default=Path("data/kenpom"),
+ help="Path to KenPom data directory",
+ )
+ parser.add_argument(
+ "--output-dir",
+ type=Path,
+ default=Path("data/ml"),
+ help="Output directory for datasets",
+ )
+ parser.add_argument(
+ "--season",
+ type=int,
+ default=2026,
+ help="KenPom season year",
+ )
+
+ args = parser.parse_args()
+
+ # Load ESPN results
+ logger.info(f"Loading ESPN results from {args.espn_data}...")
+ espn_games = pd.read_parquet(args.espn_data)
+ logger.info(f"Loaded {len(espn_games)} ESPN games")
+
+ # Initialize feature engineer
+ with FeatureEngineer(
+ kenpom_path=args.kenpom_path,
+ espn_path=Path("data/espn"), # Not used for this workflow
+ odds_db_path=args.odds_db,
+ ) as engineer:
+ # Match ESPN to Odds API
+ matched_games = match_espn_to_odds_api(espn_games, engineer.odds_db, engineer.team_mapper)
+
+ if len(matched_games) == 0:
+ logger.error("No matches found. Cannot build datasets.")
+ return
+
+ # Build training datasets
+ build_training_datasets(matched_games, engineer, args.output_dir)
+
+ logger.info("\n[OK] Dataset building complete!")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/collect_overtime_odds_csv.py b/scripts/archive/2026-02-deprecated/collect_overtime_odds_csv.py
new file mode 100644
index 000000000..aa67f134b
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/collect_overtime_odds_csv.py
@@ -0,0 +1,106 @@
+#!/usr/bin/env python3
+"""Run Overtime.ag College Basketball odds capture and write today's matchups to CSV.
+
+Uses the Puppeteer capture script to scrape Overtime.ag (Basketball > College Basketball
+and College Extra), parses spread and total odds per the DOM structure, and writes one
+row per game to a CSV file. Requires Node.js, Puppeteer, and an authenticated
+Overtime.ag session (log in manually or use --user-data-dir with a persistent profile).
+
+Output columns (per betting-data-normalizing):
+ category, game_date_str, game_time_str, away_team, home_team, away_rotation,
+ home_rotation, away_spread_raw, home_spread_raw, total_over_raw, total_under_raw,
+ spread_magnitude, favorite_team, spread_favorite_price, spread_underdog_price,
+ total_points, total_over_price, total_under_price, raw_matchup
+
+Usage:
+ uv run python scripts/collect_overtime_odds_csv.py
+ uv run python scripts/collect_overtime_odds_csv.py --output data/overtime/todays_odds.csv
+ uv run python scripts/collect_overtime_odds_csv.py --headless false \\
+ --user-data-dir ./overtime-profile
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import subprocess # nosec B404
+from datetime import date
+from pathlib import Path
+
+logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+logger = logging.getLogger(__name__)
+
+_REPO_ROOT = Path(__file__).resolve().parent.parent
+SCRIPT_PATH = _REPO_ROOT / "puppeteer" / "capture_overtime_college_basketball_odds.js"
+DEFAULT_OUTPUT_DIR = _REPO_ROOT / "data" / "overtime"
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Capture Overtime.ag NCAAB odds and write CSV of today's matchups."
+ )
+ parser.add_argument(
+ "--output",
+ "-o",
+ type=Path,
+ default=None,
+ help=(
+ "Output CSV path "
+ "(default: data/overtime/overtime_college_basketball_odds_YYYY-MM-DD.csv)"
+ ),
+ )
+ parser.add_argument(
+ "--headless",
+ choices=("true", "false"),
+ default="true",
+ help="Run browser headless (default: true)",
+ )
+ parser.add_argument(
+ "--user-data-dir",
+ type=str,
+ default=None,
+ help="Chrome user data dir for persisted Overtime.ag login",
+ )
+ args = parser.parse_args()
+
+ if args.output is not None:
+ out_path = Path(args.output)
+ else:
+ DEFAULT_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+ out_path = (
+ DEFAULT_OUTPUT_DIR / f"overtime_college_basketball_odds_{date.today().isoformat()}.csv"
+ )
+
+ out_path = out_path.resolve()
+ out_path.parent.mkdir(parents=True, exist_ok=True)
+
+ if not SCRIPT_PATH.exists():
+ logger.error("Capture script not found: %s", SCRIPT_PATH)
+ raise SystemExit(1)
+
+ cmd: list[str] = [
+ "node",
+ str(SCRIPT_PATH),
+ "--output",
+ str(out_path),
+ "--headless",
+ args.headless,
+ ]
+ if args.user_data_dir:
+ cmd.extend(["--user-data-dir", args.user_data_dir])
+
+ logger.info("Running capture (output=%s)...", out_path)
+ result = subprocess.run( # nosec B603
+ cmd,
+ cwd=str(_REPO_ROOT),
+ timeout=120,
+ )
+ if result.returncode != 0:
+ logger.error("Capture failed with exit code %s", result.returncode)
+ raise SystemExit(result.returncode)
+
+ logger.info("CSV written: %s", out_path)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/collect_overtime_realtime.py b/scripts/archive/2026-02-deprecated/collect_overtime_realtime.py
new file mode 100644
index 000000000..dd915c54a
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/collect_overtime_realtime.py
@@ -0,0 +1,182 @@
+"""Collect real-time odds from Overtime.ag via SignalR WebSocket.
+
+This script connects to Overtime.ag's SignalR hub via Chrome DevTools Protocol
+and streams live line changes. Data is saved to Parquet for analysis.
+
+Prerequisites:
+ 1. Launch Chrome with remote debugging:
+ chrome.exe --remote-debugging-port=9222 \
+ --user-data-dir=%USERPROFILE%\\.chrome-profiles\\overtime-ag
+
+ 2. Open https://www.overtime.ag/sports#/ and log in
+
+ 3. Navigate to Basketball -> College Basketball
+
+ 4. Run this script:
+ uv run python scripts/collect_overtime_realtime.py --duration 3600
+
+Usage:
+ # Collect for 1 hour
+ uv run python scripts/collect_overtime_realtime.py --duration 3600
+
+ # Collect for full game window (3 hours)
+ uv run python scripts/collect_overtime_realtime.py --duration 10800
+
+ # Collect indefinitely (Ctrl+C to stop)
+ uv run python scripts/collect_overtime_realtime.py
+
+ # Custom output location
+ uv run python scripts/collect_overtime_realtime.py --output data/raw/overtime_lines.parquet
+
+Output:
+ Parquet file with columns:
+ - timestamp: When line changed (UTC)
+ - game_num: Overtime.ag game number
+ - market_type: SPREAD, TOTAL, MONEYLINE
+ - line_points: Magnitude only (positive)
+ - side_role: FAVORITE/UNDERDOG or OVER/UNDER
+ - team: Team name (if available)
+ - money1, money2: American odds both sides
+ - is_steam: True if AutoMover
+ - captured_at: When we captured it
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+import sys
+from datetime import datetime
+from pathlib import Path
+
+# Add project root to path (must be before imports from sports_betting_edge)
+PROJECT_ROOT = Path(__file__).parent.parent
+sys.path.insert(0, str(PROJECT_ROOT / "src"))
+
+from sports_betting_edge.adapters.overtime_ag import ( # noqa: E402
+ OvertimeSignalRClient,
+)
+from sports_betting_edge.core.exceptions import ConfigurationError # noqa: E402
+
+# Setup logging
+logging.basicConfig(
+ level=logging.INFO,
+ format="[%(asctime)s] %(levelname)s - %(message)s",
+ datefmt="%Y-%m-%d %H:%M:%S",
+)
+logger = logging.getLogger(__name__)
+
+
+async def collect_line_changes(
+ duration_seconds: int | None,
+ output_path: Path,
+) -> int:
+ """Collect line changes and save to Parquet.
+
+ Args:
+ duration_seconds: How long to collect (None = indefinite)
+ output_path: Where to save Parquet file
+
+ Returns:
+ Number of line changes collected
+ """
+ logger.info("Starting Overtime.ag SignalR collection")
+ logger.info("Duration: %s", f"{duration_seconds}s" if duration_seconds else "indefinite")
+ logger.info("Output: %s", output_path)
+
+ # Create output directory
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+
+ # Collect line changes
+ line_changes = []
+
+ try:
+ async with OvertimeSignalRClient() as client:
+ logger.info("Connected to SignalR stream")
+
+ async for change in client.stream_line_changes(duration_seconds):
+ # Log each change
+ team_or_game = change.team or f"Game#{change.game_num}"
+ line_display = f"{change.line_points:+.1f}" if change.line_points else "ML"
+ steam_flag = " [STEAM]" if change.is_steam else ""
+
+ logger.info(
+ "%s: %s %s (%s)%s",
+ change.sport_sub_type,
+ team_or_game,
+ line_display,
+ change.market_type.value,
+ steam_flag,
+ )
+
+ # Collect for Parquet export
+ line_changes.append(change.model_dump())
+
+ except ConfigurationError as e:
+ logger.error("Configuration error: %s", e)
+ logger.error(
+ "Make sure Chrome is running with remote debugging and overtime.ag tab is open"
+ )
+ return 0
+ except KeyboardInterrupt:
+ logger.info("Collection interrupted by user")
+ except Exception as e:
+ logger.exception("Unexpected error: %s", e)
+ return 0
+
+ # Save to Parquet
+ if line_changes:
+ try:
+ import polars as pl
+
+ df = pl.DataFrame(line_changes)
+ df.write_parquet(output_path)
+ logger.info("Saved %d line changes to %s", len(line_changes), output_path)
+ except ImportError:
+ logger.warning("polars not installed, skipping Parquet export")
+ logger.info("Install with: uv add polars")
+ except Exception as e:
+ logger.error("Failed to save Parquet: %s", e)
+ else:
+ logger.warning("No line changes collected")
+
+ return len(line_changes)
+
+
+def main() -> None:
+ """CLI entry point."""
+ parser = argparse.ArgumentParser(
+ description="Collect real-time odds from Overtime.ag SignalR WebSocket"
+ )
+ parser.add_argument(
+ "--duration",
+ "-d",
+ type=int,
+ default=None,
+ help="Collection duration in seconds (default: indefinite)",
+ )
+ default_filename = f"overtime_lines_{datetime.now():%Y%m%d_%H%M%S}.parquet"
+ parser.add_argument(
+ "--output",
+ "-o",
+ type=Path,
+ default=PROJECT_ROOT / "data" / "raw" / default_filename,
+ help="Output Parquet file path",
+ )
+
+ args = parser.parse_args()
+
+ # Run collection
+ count = asyncio.run(collect_line_changes(args.duration, args.output))
+
+ if count > 0:
+ logger.info("Collection complete: %d line changes", count)
+ sys.exit(0)
+ else:
+ logger.error("Collection failed or no data collected")
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/collect_overtime_scheduled.py b/scripts/archive/2026-02-deprecated/collect_overtime_scheduled.py
new file mode 100644
index 000000000..a3a2e4428
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/collect_overtime_scheduled.py
@@ -0,0 +1,297 @@
+"""Scheduled Overtime.ag line collection - runs every 10 minutes until gametime.
+
+Collects live line snapshots at regular intervals for games that haven't started yet.
+Automatically stops collecting for games once they begin. Ideal for building
+line movement datasets and detecting closing line value (CLV).
+
+Architecture:
+ 1. Fetch today's game schedule (ESPN or similar)
+ 2. Collect current lines from Overtime.ag every 10 minutes
+ 3. Save snapshots to timestamped Parquet files
+ 4. Stop collecting for games that have started
+ 5. Exit when all games have begun or end of day
+
+Usage:
+ # Run continuously until all games start
+ uv run python scripts/collect_overtime_scheduled.py
+
+ # Custom interval (5 minutes)
+ uv run python scripts/collect_overtime_scheduled.py --interval 300
+
+ # Stop at specific time
+ uv run python scripts/collect_overtime_scheduled.py --stop-at "23:00"
+
+ # Test mode (single collection)
+ uv run python scripts/collect_overtime_scheduled.py --test
+
+Output:
+ data/raw/overtime_snapshots/
+ overtime_snapshot_20240202_140000.parquet
+ overtime_snapshot_20240202_141000.parquet
+ ...
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+import sys
+from datetime import datetime, time
+from pathlib import Path
+
+# Add project root to path
+PROJECT_ROOT = Path(__file__).parent.parent
+sys.path.insert(0, str(PROJECT_ROOT / "src"))
+
+from sports_betting_edge.adapters.overtime_ag import ( # noqa: E402
+ OvertimeSignalRClient,
+)
+from sports_betting_edge.core.exceptions import ConfigurationError # noqa: E402
+
+# Setup logging
+logging.basicConfig(
+ level=logging.INFO,
+ format="[%(asctime)s] %(levelname)s - %(message)s",
+ datefmt="%Y-%m-%d %H:%M:%S",
+)
+logger = logging.getLogger(__name__)
+
+
+class ScheduledLineCollector:
+ """Collects Overtime.ag lines at regular intervals until games start."""
+
+ def __init__(
+ self,
+ interval_seconds: int = 600,
+ output_dir: Path | None = None,
+ stop_time: time | None = None,
+ ):
+ """Initialize the scheduled collector.
+
+ Args:
+ interval_seconds: Seconds between collections (default: 600 = 10 min)
+ output_dir: Where to save snapshots
+ stop_time: Stop collecting at this time (None = run until all games start)
+ """
+ self.interval_seconds = interval_seconds
+ self.output_dir = output_dir or (PROJECT_ROOT / "data" / "raw" / "overtime_snapshots")
+ self.stop_time = stop_time
+ self.games_started: set[int] = set()
+ self.collection_count = 0
+
+ async def run(self, test_mode: bool = False) -> int:
+ """Run the scheduled collector.
+
+ Args:
+ test_mode: If True, run single collection and exit
+
+ Returns:
+ Total snapshots collected
+ """
+ logger.info("Starting scheduled Overtime.ag line collector")
+ logger.info(
+ "Interval: %d seconds (%d minutes)", self.interval_seconds, self.interval_seconds // 60
+ )
+ logger.info("Output: %s", self.output_dir)
+
+ # Create output directory
+ self.output_dir.mkdir(parents=True, exist_ok=True)
+
+ # Test mode - single collection
+ if test_mode:
+ logger.info("TEST MODE - Single collection")
+ await self._collect_snapshot()
+ return self.collection_count
+
+ # Main collection loop
+ while True:
+ try:
+ # Collect current lines
+ await self._collect_snapshot()
+
+ # Check stop conditions
+ if self._should_stop():
+ logger.info("Stopping collection (stop condition met)")
+ break
+
+ # Wait for next interval
+ logger.info("Next collection in %d seconds", self.interval_seconds)
+ await asyncio.sleep(self.interval_seconds)
+
+ except KeyboardInterrupt:
+ logger.info("Collection interrupted by user")
+ break
+ except Exception as e:
+ logger.exception("Error in collection loop: %s", e)
+ # Wait before retrying
+ await asyncio.sleep(60)
+
+ logger.info("Collection complete. Total snapshots: %d", self.collection_count)
+ return self.collection_count
+
+ async def _collect_snapshot(self) -> None:
+ """Collect a single snapshot of current lines."""
+ timestamp = datetime.now()
+ snapshot_time = timestamp.strftime("%Y%m%d_%H%M%S")
+ output_path = self.output_dir / f"overtime_snapshot_{snapshot_time}.parquet"
+
+ logger.info("[->] Collecting snapshot at %s", timestamp.strftime("%H:%M:%S"))
+
+ try:
+ # Collect lines for short duration (30 seconds to get current state)
+ line_changes = []
+
+ async with OvertimeSignalRClient() as client:
+ async for change in client.stream_line_changes(duration_seconds=30):
+ line_changes.append(change.model_dump())
+
+ # Track game numbers we've seen
+ game_num = change.game_num
+ if game_num not in self.games_started:
+ # Check if game has started (could integrate with schedule here)
+ # For now, just track all game numbers
+ pass
+
+ if line_changes:
+ # Save snapshot
+ try:
+ import polars as pl
+
+ df = pl.DataFrame(line_changes)
+ df.write_parquet(output_path)
+ logger.info(
+ "[OK] Saved %d lines to %s",
+ len(line_changes),
+ output_path.name,
+ )
+ self.collection_count += 1
+
+ # Log interesting movements
+ steam_count = sum(1 for lc in line_changes if lc.get("is_steam"))
+ if steam_count > 0:
+ logger.info(
+ "[STEAM] %d steam moves detected in this snapshot",
+ steam_count,
+ )
+ except ImportError:
+ logger.warning("polars not installed, skipping Parquet save")
+ logger.info("Install with: uv add polars")
+ else:
+ logger.warning("[WARNING] No lines collected in this snapshot")
+
+ except ConfigurationError as e:
+ logger.error("Configuration error: %s", e)
+ logger.error(
+ "Make sure Chrome is running with remote debugging and overtime.ag tab is open"
+ )
+ raise
+ except Exception as e:
+ logger.exception("Failed to collect snapshot: %s", e)
+ raise
+
+ def _should_stop(self) -> bool:
+ """Check if we should stop collecting.
+
+ Returns:
+ True if stop condition met
+ """
+ # Check stop time
+ if self.stop_time:
+ current_time = datetime.now().time()
+ if current_time >= self.stop_time:
+ logger.info("Stop time reached: %s", self.stop_time)
+ return True
+
+ # Could add logic here to check if all games have started
+ # by integrating with ESPN schedule or checking game times
+
+ return False
+
+
+def parse_time(time_str: str) -> time:
+ """Parse time string in HH:MM format.
+
+ Args:
+ time_str: Time string like "23:00"
+
+ Returns:
+ time object
+
+ Raises:
+ ValueError: If format invalid
+ """
+ try:
+ hour, minute = map(int, time_str.split(":"))
+ return time(hour, minute)
+ except (ValueError, AttributeError) as e:
+ raise ValueError(f"Invalid time format: {time_str}. Use HH:MM") from e
+
+
+def main() -> None:
+ """CLI entry point."""
+ parser = argparse.ArgumentParser(
+ description="Collect Overtime.ag lines every N minutes until gametime"
+ )
+ parser.add_argument(
+ "--interval",
+ "-i",
+ type=int,
+ default=600,
+ help="Collection interval in seconds (default: 600 = 10 minutes)",
+ )
+ parser.add_argument(
+ "--output-dir",
+ "-o",
+ type=Path,
+ default=None,
+ help="Output directory for snapshots (default: data/raw/overtime_snapshots)",
+ )
+ parser.add_argument(
+ "--stop-at",
+ "-s",
+ type=str,
+ default=None,
+ help="Stop collecting at this time (HH:MM format, e.g., '23:00')",
+ )
+ parser.add_argument(
+ "--test",
+ "-t",
+ action="store_true",
+ help="Test mode: run single collection and exit",
+ )
+
+ args = parser.parse_args()
+
+ # Parse stop time if provided
+ stop_time = None
+ if args.stop_at:
+ try:
+ stop_time = parse_time(args.stop_at)
+ logger.info("Will stop collecting at %s", args.stop_at)
+ except ValueError as e:
+ logger.error("Invalid stop time: %s", e)
+ sys.exit(1)
+
+ # Create collector
+ collector = ScheduledLineCollector(
+ interval_seconds=args.interval,
+ output_dir=args.output_dir,
+ stop_time=stop_time,
+ )
+
+ # Run collection
+ try:
+ count = asyncio.run(collector.run(test_mode=args.test))
+ logger.info("Collected %d snapshots", count)
+ sys.exit(0)
+ except ConfigurationError:
+ logger.error("Configuration error - check Chrome and overtime.ag setup")
+ sys.exit(1)
+ except Exception as e:
+ logger.exception("Collection failed: %s", e)
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/comprehensive_odds_with_overtime.py b/scripts/archive/2026-02-deprecated/comprehensive_odds_with_overtime.py
new file mode 100644
index 000000000..f26354416
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/comprehensive_odds_with_overtime.py
@@ -0,0 +1,343 @@
+"""Comprehensive odds report with overtime.ag comparison.
+
+Compares:
+- KenPom FanMatch predictions
+- Odds API bookmakers (DraftKings, FanDuel, BetMGM)
+- Overtime.ag live odds (from SignalR stream)
+"""
+
+import re
+import sqlite3
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+
+
+def parse_overtime_log(log_file: Path) -> dict[str, dict[str, dict]]:
+ """Parse overtime.ag log file for latest odds by team/market.
+
+ Returns:
+ Dict mapping (team, market) -> {line, money1, money2, timestamp}
+ """
+ if not log_file.exists():
+ return {}
+
+ current_lines = {}
+
+ with open(log_file) as f:
+ for line in f:
+ if "|" not in line or "Monitor" in line or "Game#" in line:
+ continue
+
+ parts = [p.strip() for p in line.split("|")]
+ if len(parts) < 5:
+ continue
+
+ timestamp = parts[0]
+ team = parts[1]
+ market = parts[2]
+ line_val = parts[3]
+ odds = parts[4]
+
+ # Parse odds (format: money1/money2)
+ odds_match = re.search(r"(-?\d+)/(-?\d+)", odds)
+ if odds_match:
+ money1 = int(odds_match.group(1))
+ money2 = int(odds_match.group(2))
+ else:
+ money1 = None
+ money2 = None
+
+ # Store latest line for each team/market combo
+ key = (team, market)
+ current_lines[key] = {
+ "timestamp": timestamp,
+ "line": line_val,
+ "money1": money1,
+ "money2": money2,
+ }
+
+ return current_lines
+
+
+def match_overtime_team(overtime_team: str, away_team: str, home_team: str) -> tuple[bool, str]:
+ """Match overtime.ag team name to prediction team names.
+
+ Returns:
+ (is_match, matched_team_name)
+ """
+ # Try various matching strategies
+ overtime_lower = overtime_team.lower()
+ away_lower = away_team.lower()
+ home_lower = home_team.lower()
+
+ # Exact match
+ if overtime_lower in away_lower or away_lower in overtime_lower:
+ return True, away_team
+ if overtime_lower in home_lower or home_lower in overtime_lower:
+ return True, home_team
+
+ # Word-level matching (handle "North Carolina" vs "UNC")
+ overtime_words = set(overtime_lower.split())
+ away_words = set(away_lower.split())
+ home_words = set(home_lower.split())
+
+ if overtime_words & away_words:
+ return True, away_team
+ if overtime_words & home_words:
+ return True, home_team
+
+ return False, ""
+
+
+print("=" * 120)
+print(
+ f"COMPREHENSIVE ODDS & PREDICTIONS REPORT - "
+ f"{datetime.now().strftime('%A, %B %d, %Y - %I:%M %p')}"
+)
+print("=" * 120)
+
+# Load predictions (includes KenPom data)
+print("\n[1/4] Loading predictions with KenPom data...")
+preds = pd.read_csv("data/outputs/predictions/2026-02-02.csv")
+print(f" Loaded {len(preds)} games")
+
+# Load odds from database
+print("[2/4] Loading latest odds from all bookmakers...")
+conn = sqlite3.connect("data/odds_api/odds_api.sqlite3")
+
+# Get today's events
+events = pd.read_sql(
+ """
+ SELECT event_id, home_team, away_team, commence_time
+ FROM events
+ WHERE DATE(commence_time) >= '2026-02-02'
+ ORDER BY commence_time
+""",
+ conn,
+)
+
+# Get latest observations for all markets
+if len(events) > 0:
+ event_ids = "','".join(events["event_id"].tolist())
+
+ # Get latest odds (most recent update per event/book/market/outcome)
+ latest_odds = pd.read_sql(
+ f"""
+ SELECT
+ event_id,
+ book_key,
+ market_key,
+ outcome_name,
+ point,
+ price_american,
+ MAX(book_last_update) as last_update
+ FROM observations
+ WHERE event_id IN ('{event_ids}')
+ GROUP BY event_id, book_key, market_key, outcome_name
+ """,
+ conn,
+ )
+
+ print(f" Loaded {len(latest_odds)} odds observations")
+else:
+ latest_odds = pd.DataFrame()
+
+conn.close()
+
+# Parse overtime.ag log
+print("[3/4] Parsing overtime.ag log file...")
+log_file = Path("data/logs/line_movements_2026-02-02.log")
+overtime_lines = parse_overtime_log(log_file)
+print(f" Found {len(overtime_lines)} overtime.ag line entries")
+
+print("[4/4] Generating comprehensive report...\n")
+
+# Generate report
+print("=" * 120)
+print("TODAY'S GAMES - COMPLETE ODDS & PREDICTIONS")
+print("=" * 120)
+
+for idx, pred_game in preds.iterrows():
+ away = pred_game["away_team"]
+ home = pred_game["home_team"]
+ tip_time = pred_game["commence_time"]
+
+ print(f"\n{'=' * 120}")
+ print(f"GAME #{idx + 1}: {away} @ {home}")
+ print(f"Tip-off: {tip_time}")
+ print(f"{'=' * 120}")
+
+ # Match to odds database event
+ matching_event = events[
+ (
+ (events["away_team"].str.contains(away.split()[0], case=False, na=False))
+ & (events["home_team"].str.contains(home.split()[0], case=False, na=False))
+ )
+ | (
+ (events["home_team"].str.contains(away.split()[0], case=False, na=False))
+ & (events["away_team"].str.contains(home.split()[0], case=False, na=False))
+ )
+ ]
+
+ # KENPOM PREDICTIONS
+ print("\nKENPOM PREDICTIONS:")
+ print(
+ f" Favorite: {pred_game['favorite_team'][:35]:35} "
+ f"Spread: {pred_game['spread_magnitude']:+.1f}"
+ )
+ print(f" Underdog: {pred_game['underdog_team'][:35]:35}")
+ print(f" Total: {pred_game['total_points']:.1f}")
+ print(
+ f" Model Probabilities: Favorite Cover: {pred_game['favorite_cover_prob']:.1%} "
+ f"| Over: {pred_game['over_prob']:.1%}"
+ )
+
+ # VALUE INDICATORS
+ spread_edge = pred_game["spread_edge"]
+ total_edge = pred_game["total_edge"]
+
+ value_str = ""
+ if abs(spread_edge) >= 0.05:
+ side = "Favorite" if spread_edge > 0 else "Underdog"
+ value_str += f" [VALUE] Spread {side}: {spread_edge:+.1%} edge | "
+ if abs(total_edge) >= 0.05:
+ side = "Over" if total_edge > 0 else "Under"
+ value_str += f"[VALUE] Total {side}: {total_edge:+.1%} edge"
+
+ if value_str:
+ print(f"\n{value_str}")
+
+ # BOOKMAKER ODDS
+ print("\nBOOKMAKER ODDS:")
+
+ has_any_odds = False
+
+ if len(matching_event) > 0:
+ event_id = matching_event.iloc[0]["event_id"]
+ event_odds = latest_odds[latest_odds["event_id"] == event_id]
+
+ if len(event_odds) > 0:
+ has_any_odds = True
+
+ # Show odds for major books
+ for book in ["draftkings", "fanduel", "betmgm"]:
+ book_odds = event_odds[event_odds["book_key"] == book]
+
+ if len(book_odds) == 0:
+ continue
+
+ print(f"\n {book.upper()}:")
+
+ # SPREAD
+ spread_odds = book_odds[book_odds["market_key"] == "spreads"]
+ if len(spread_odds) >= 2:
+ home_spread = spread_odds[
+ spread_odds["outcome_name"] == matching_event.iloc[0]["home_team"]
+ ]
+ away_spread = spread_odds[
+ spread_odds["outcome_name"] == matching_event.iloc[0]["away_team"]
+ ]
+
+ if len(home_spread) > 0 and len(away_spread) > 0:
+ h_line = home_spread.iloc[0]["point"]
+ h_price = home_spread.iloc[0]["price_american"]
+ a_line = away_spread.iloc[0]["point"]
+ a_price = away_spread.iloc[0]["price_american"]
+
+ print(
+ f" Spread: {home[:20]:20} {h_line:+.1f} ({h_price:+d}) | "
+ f"{away[:20]:20} {a_line:+.1f} ({a_price:+d})"
+ )
+
+ # MONEYLINE
+ ml_odds = book_odds[book_odds["market_key"] == "h2h"]
+ if len(ml_odds) >= 2:
+ home_ml = ml_odds[
+ ml_odds["outcome_name"] == matching_event.iloc[0]["home_team"]
+ ]
+ away_ml = ml_odds[
+ ml_odds["outcome_name"] == matching_event.iloc[0]["away_team"]
+ ]
+
+ if len(home_ml) > 0 and len(away_ml) > 0:
+ h_ml = home_ml.iloc[0]["price_american"]
+ a_ml = away_ml.iloc[0]["price_american"]
+
+ print(
+ f" Moneyline: {home[:20]:20} {h_ml:+5d} | "
+ f"{away[:20]:20} {a_ml:+5d}"
+ )
+
+ # TOTAL
+ total_odds = book_odds[book_odds["market_key"] == "totals"]
+ if len(total_odds) >= 2:
+ over = total_odds[total_odds["outcome_name"] == "Over"]
+ under = total_odds[total_odds["outcome_name"] == "Under"]
+
+ if len(over) > 0 and len(under) > 0:
+ total_line = over.iloc[0]["point"]
+ over_price = over.iloc[0]["price_american"]
+ under_price = under.iloc[0]["price_american"]
+
+ print(
+ f" Total: O {total_line:.1f} ({over_price:+d}) | "
+ f"U {total_line:.1f} ({under_price:+d})"
+ )
+
+ # OVERTIME.AG ODDS
+ # Try to find overtime.ag lines for this game
+ overtime_spread = None
+ overtime_total = None
+ overtime_ml = None
+
+ for (team, market), data in overtime_lines.items():
+ is_match, matched_team = match_overtime_team(team, away, home)
+
+ if not is_match:
+ continue
+
+ if market == "SPREAD":
+ overtime_spread = (team, data)
+ elif market == "TOTAL":
+ overtime_total = (team, data)
+ elif market == "MONEYLINE":
+ overtime_ml = (team, data)
+
+ # Display overtime.ag odds if found
+ if overtime_spread or overtime_total or overtime_ml:
+ has_any_odds = True
+ print("\n OVERTIME.AG:")
+
+ if overtime_spread:
+ team, data = overtime_spread
+ line = data["line"]
+ odds = f"{data['money1']:+d}/{data['money2']:+d}" if data["money1"] else "N/A"
+ print(f" Spread: {team[:20]:20} {line} ({odds})")
+
+ if overtime_ml:
+ team, data = overtime_ml
+ odds = f"{data['money1']:+d}/{data['money2']:+d}" if data["money1"] else "N/A"
+ print(f" Moneyline: {team[:20]:20} {odds}")
+
+ if overtime_total:
+ team, data = overtime_total
+ line = data["line"]
+ odds = f"O:{data['money1']:+d}/U:{data['money2']:+d}" if data["money1"] else "N/A"
+ print(f" Total: {line} ({odds})")
+
+ if not has_any_odds:
+ print("\n [No bookmaker odds available yet]")
+
+print("\n" + "=" * 120)
+print("SUMMARY")
+print("=" * 120)
+print(f"Total games: {len(preds)}")
+print(
+ f"Games with 5%+ edge: "
+ f"{len(preds[(abs(preds['spread_edge']) >= 0.05) | (abs(preds['total_edge']) >= 0.05)])}"
+)
+print("Bookmakers tracked: DraftKings, FanDuel, BetMGM, Overtime.ag")
+print("Markets: Spreads, Moneylines, Totals")
+print("=" * 120)
diff --git a/scripts/archive/2026-02-deprecated/create_team_mapping.py b/scripts/archive/2026-02-deprecated/create_team_mapping.py
new file mode 100644
index 000000000..6202ba0b2
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/create_team_mapping.py
@@ -0,0 +1,279 @@
+"""Create team name mapping across KenPom, ESPN, and Odds API.
+
+Uses fuzzy string matching to map team names across different data sources.
+
+Usage:
+ uv run python scripts/create_team_mapping.py
+"""
+
+import logging
+from pathlib import Path
+
+import pandas as pd
+from thefuzz import fuzz, process
+
+from sports_betting_edge.adapters.filesystem import write_parquet
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def load_kenpom_teams(season: int = 2026) -> set[str]:
+ """Load unique KenPom team names.
+
+ Args:
+ season: Season year
+
+ Returns:
+ Set of team names from KenPom
+ """
+ kenpom_file = Path(f"data/kenpom/ratings/season/ratings_{season}.parquet")
+ if not kenpom_file.exists():
+ logger.warning(f"KenPom ratings not found: {kenpom_file}")
+ return set()
+
+ df = pd.read_parquet(kenpom_file)
+ teams = set(df["TeamName"].unique())
+ logger.info(f"Loaded {len(teams)} teams from KenPom")
+ return teams
+
+
+def load_espn_teams() -> set[str]:
+ """Load unique ESPN team names from schedule files.
+
+ Returns:
+ Set of team names from ESPN
+ """
+ espn_dir = Path("data/espn/schedule")
+ if not espn_dir.exists():
+ logger.warning(f"ESPN schedule directory not found: {espn_dir}")
+ return set()
+
+ teams = set()
+ parquet_files = list(espn_dir.glob("*.parquet"))
+
+ for file in parquet_files:
+ try:
+ df = pd.read_parquet(file)
+ teams.update(df["home_team"].unique())
+ teams.update(df["away_team"].unique())
+ except Exception as e:
+ logger.warning(f"Error reading {file}: {e}")
+
+ logger.info(f"Loaded {len(teams)} teams from ESPN ({len(parquet_files)} files)")
+ return teams
+
+
+def load_odds_api_teams() -> set[str]:
+ """Load unique Odds API team names from database.
+
+ Returns:
+ Set of team names from Odds API
+ """
+ import sqlite3
+
+ db_path = Path("data/odds_api/odds_api.sqlite3")
+ if not db_path.exists():
+ logger.warning(f"Odds API database not found: {db_path}")
+ return set()
+
+ conn = sqlite3.connect(str(db_path))
+
+ # Get teams from events table
+ query = """
+ SELECT DISTINCT home_team FROM events
+ UNION
+ SELECT DISTINCT away_team FROM events
+ """
+
+ teams_df = pd.read_sql_query(query, conn)
+ conn.close()
+
+ teams = set(teams_df.iloc[:, 0].unique())
+ logger.info(f"Loaded {len(teams)} teams from Odds API")
+ return teams
+
+
+def fuzzy_match_team(
+ team_name: str, candidates: list[str], threshold: int = 70
+) -> tuple[str | None, int]:
+ """Find best fuzzy match for a team name.
+
+ Prioritizes exact matches and prefix matches over fuzzy matches.
+
+ Args:
+ team_name: Team name to match
+ candidates: List of candidate team names
+ threshold: Minimum match score (0-100)
+
+ Returns:
+ (best_match, score) or (None, 0) if no match above threshold
+ """
+ if not candidates:
+ return None, 0
+
+ # Step 1: Try exact match
+ for candidate in candidates:
+ if team_name.lower() == candidate.lower():
+ return candidate, 100
+
+ # Step 2: Try exact match with normalized spaces/punctuation
+ normalized_team = team_name.lower().replace(".", "").replace(" ", " ").strip()
+ for candidate in candidates:
+ normalized_candidate = candidate.lower().replace(".", "").replace(" ", " ").strip()
+ if normalized_team == normalized_candidate:
+ return candidate, 100
+
+ # Step 3: Try prefix match (e.g., "Kentucky" at start of "Kentucky Wildcats")
+ for candidate in candidates:
+ if candidate.lower().startswith(team_name.lower() + " "):
+ # Verify it's not a different school with similar name
+ # e.g., "Alabama" shouldn't match "Alabama A&M"
+ words_after = candidate[len(team_name) :].strip().split()
+ # If next word is A&M, State, Tech, etc., different school
+ exclusions = ["a&m", "state", "tech", "international"]
+ if words_after and words_after[0].lower() not in exclusions:
+ return candidate, 95
+
+ # Step 4: Use fuzzy matching (but be careful)
+ result = process.extractOne(team_name, candidates, scorer=fuzz.token_sort_ratio)
+
+ if result and result[1] >= threshold:
+ # Extra validation: check it's not a completely different school
+ match_text = result[0].lower()
+ team_lower = team_name.lower()
+
+ # Don't match if the team name appears in the middle of another school's name
+ # e.g., "Duke" shouldn't match "James Madison Dukes"
+ if team_lower in match_text and not match_text.startswith(team_lower):
+ return None, 0
+
+ return result[0], result[1]
+
+ return None, 0
+
+
+def create_team_mapping() -> pd.DataFrame:
+ """Create comprehensive team name mapping.
+
+ Returns:
+ DataFrame with columns: kenpom_name, espn_name, odds_api_name, match_confidence
+ """
+ logger.info("Loading team names from all sources...")
+
+ kenpom_teams = load_kenpom_teams()
+ espn_teams = load_espn_teams()
+ odds_api_teams = load_odds_api_teams()
+
+ # Use KenPom as the canonical source (most standardized names)
+ mappings = []
+
+ for kenpom_name in sorted(kenpom_teams):
+ mapping = {
+ "kenpom_name": kenpom_name,
+ "espn_name": None,
+ "espn_match_score": 0,
+ "odds_api_name": None,
+ "odds_api_match_score": 0,
+ }
+
+ # Match to ESPN (lower threshold since ESPN has fewer teams)
+ espn_match, espn_score = fuzzy_match_team(kenpom_name, list(espn_teams), threshold=70)
+ if espn_match:
+ mapping["espn_name"] = espn_match
+ mapping["espn_match_score"] = espn_score
+
+ # Match to Odds API (lower threshold, Odds API has mascots)
+ odds_match, odds_score = fuzzy_match_team(
+ kenpom_name,
+ list(odds_api_teams),
+ threshold=60, # Lower because "Kentucky" vs "Kentucky Wildcats"
+ )
+ if odds_match:
+ mapping["odds_api_name"] = odds_match
+ mapping["odds_api_match_score"] = odds_score
+
+ mappings.append(mapping)
+
+ df = pd.DataFrame(mappings)
+
+ # Calculate overall match confidence
+ df["avg_match_score"] = (df["espn_match_score"] + df["odds_api_match_score"]) / 2
+
+ return df
+
+
+def validate_mappings(df: pd.DataFrame) -> None:
+ """Validate and report on mapping quality.
+
+ Args:
+ df: Team mapping DataFrame
+ """
+ total = len(df)
+
+ # Count matches
+ espn_matched = df["espn_name"].notna().sum()
+ odds_matched = df["odds_api_name"].notna().sum()
+ both_matched = ((df["espn_name"].notna()) & (df["odds_api_name"].notna())).sum()
+
+ logger.info("\n=== Mapping Quality ===")
+ logger.info(f"Total KenPom teams: {total}")
+ logger.info(f"Matched to ESPN: {espn_matched} ({espn_matched / total:.1%})")
+ logger.info(f"Matched to Odds API: {odds_matched} ({odds_matched / total:.1%})")
+ logger.info(f"Matched to both: {both_matched} ({both_matched / total:.1%})")
+
+ # Show low-confidence matches
+ low_confidence = df[
+ ((df["espn_match_score"] > 0) & (df["espn_match_score"] < 90))
+ | ((df["odds_api_match_score"] > 0) & (df["odds_api_match_score"] < 90))
+ ].sort_values("avg_match_score")
+
+ if len(low_confidence) > 0:
+ logger.info("\n=== Low Confidence Matches (score < 90) ===")
+ for _, row in low_confidence.head(10).iterrows():
+ logger.info(f"\nKenPom: {row['kenpom_name']}")
+ if row["espn_name"]:
+ logger.info(f" ESPN: {row['espn_name']} (score: {row['espn_match_score']})")
+ if row["odds_api_name"]:
+ logger.info(
+ f" Odds API: {row['odds_api_name']} (score: {row['odds_api_match_score']})"
+ )
+
+ # Show unmatched teams
+ unmatched = df[(df["espn_name"].isna()) & (df["odds_api_name"].isna())]
+
+ if len(unmatched) > 0:
+ logger.info(f"\n=== Unmatched Teams ({len(unmatched)}) ===")
+ for name in unmatched["kenpom_name"].head(10):
+ logger.info(f" - {name}")
+
+
+def main() -> None:
+ """Create and save team name mapping."""
+ logger.info("Creating team name mapping...")
+
+ # Create mapping
+ df = create_team_mapping()
+
+ # Validate
+ validate_mappings(df)
+
+ # Save to parquet (fill NaN with empty string for pyarrow compatibility)
+ output_path = Path("data/staging/mappings/team_mapping.parquet")
+ df_clean = df.fillna("") # Replace NaN with empty string
+ write_parquet(df_clean.to_dict(orient="records"), output_path)
+
+ logger.info(f"\n[OK] Team mapping saved to {output_path}")
+ logger.info(f"Total mappings: {len(df)}")
+
+ # Show sample
+ logger.info("\n=== Sample Mappings ===")
+ sample = df[df["odds_api_name"].notna()].head(5)
+ for _, row in sample.iterrows():
+ logger.info(f"\nKenPom: {row['kenpom_name']}")
+ logger.info(f" ESPN: {row['espn_name']} (score: {row['espn_match_score']})")
+ logger.info(f" Odds API: {row['odds_api_name']} (score: {row['odds_api_match_score']})")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/demo_bookmaker_accuracy.py b/scripts/archive/2026-02-deprecated/demo_bookmaker_accuracy.py
new file mode 100644
index 000000000..40839cc5d
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/demo_bookmaker_accuracy.py
@@ -0,0 +1,229 @@
+"""Demonstration script for bookmaker accuracy analysis.
+
+This script shows the bookmaker accuracy backtesting system, including:
+- Spread accuracy metrics (MAE, RMSE, directional accuracy)
+- Totals accuracy metrics
+- Systematic bias detection
+- Bookmaker rankings
+
+Usage:
+ uv run python scripts/demo_bookmaker_accuracy.py
+"""
+
+import logging
+from datetime import date
+from pathlib import Path
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.services.bookmaker_accuracy import BookmakerAccuracyAnalyzer
+
+logging.basicConfig(level=logging.INFO, format="%(message)s")
+logger = logging.getLogger(__name__)
+
+
+def main() -> None:
+ """Demonstrate bookmaker accuracy analysis capabilities."""
+ # Paths
+ odds_db_path = Path("data/odds_api/odds_api.sqlite3")
+
+ # Check if data exists
+ if not odds_db_path.exists():
+ logger.warning(f"Database not found: {odds_db_path}")
+ logger.info("Run collect_hybrid.py first to gather data")
+ return
+
+ logger.info("=" * 70)
+ logger.info("BOOKMAKER ACCURACY BACKTESTING DEMO")
+ logger.info("=" * 70)
+ logger.info("")
+
+ # Initialize
+ db = OddsAPIDatabase(odds_db_path)
+ analyzer = BookmakerAccuracyAnalyzer(db)
+
+ # Get database stats
+ logger.info("[OK] Database Coverage Statistics:")
+ stats = db.get_database_stats()
+ logger.info(f" - Total events: {stats['total_events']}")
+ logger.info(f" - Events with scores: {stats['events_with_scores']}")
+ logger.info(f" - Coverage: {stats['events_with_scores'] / max(stats['total_events'], 1):.1%}")
+ logger.info(f" - Date range: {stats['date_range'][0]} to {stats['date_range'][1]}")
+ logger.info("")
+
+ if stats["events_with_scores"] < 10:
+ logger.warning("[WARNING] Insufficient data for meaningful analysis")
+ logger.info("Need at least 10 completed games with scores")
+ return
+
+ # Show top bookmakers by coverage
+ logger.info("[OK] Top Bookmakers by Coverage:")
+ for i, book in enumerate(stats["bookmaker_coverage"][:5], 1):
+ logger.info(
+ f" {i}. {book['book_key']:20s} "
+ f"{book['games_covered']:>4} games ({book['coverage_pct']:.1f}%)"
+ )
+ logger.info("")
+
+ # Analyze spread accuracy for top bookmaker
+ if stats["bookmaker_coverage"]:
+ top_book = stats["bookmaker_coverage"][0]
+ book_key = top_book["book_key"]
+ games_covered = top_book["games_covered"]
+
+ if games_covered >= 30:
+ logger.info(f"[OK] Analyzing Spread Accuracy: {book_key}")
+ logger.info("-" * 70)
+
+ try:
+ # Parse date range
+ start_date = date.fromisoformat(stats["date_range"][0])
+ end_date = date.fromisoformat(stats["date_range"][1])
+
+ metrics = analyzer.calculate_spread_accuracy(
+ book_key=book_key,
+ start_date=start_date,
+ end_date=end_date,
+ min_games=10,
+ )
+
+ logger.info(f"Sample size: {metrics['sample_size']} games")
+ logger.info(f"Mean Absolute Error (MAE): {metrics['mae']:.2f} points")
+ logger.info(f"Root Mean Squared Error (RMSE): {metrics['rmse']:.2f} points")
+ logger.info(f"Favorite cover rate: {metrics['favorite_cover_pct']:.1%}")
+
+ # Interpretation
+ logger.info("")
+ if metrics["mae"] < 8.0:
+ logger.info("[OK] ACCURACY: Excellent (MAE < 8 points)")
+ elif metrics["mae"] < 10.0:
+ logger.info("[OK] ACCURACY: Good (MAE < 10 points)")
+ else:
+ logger.info("[WARNING] ACCURACY: Below average (MAE > 10 points)")
+
+ logger.info("")
+
+ # Detect biases
+ logger.info(f"[OK] Checking for Systematic Biases: {book_key}")
+ logger.info("-" * 70)
+
+ biases = analyzer.detect_systematic_biases(
+ book_key=book_key,
+ market_type="spreads",
+ start_date=start_date,
+ end_date=end_date,
+ )
+
+ if biases["overestimates_favorites"]:
+ logger.info("[WARNING] BIAS DETECTED: Overestimates favorites")
+ logger.info(
+ f" -> Favorites cover only {biases.get('favorite_cover_pct', 0):.1%} "
+ "(expected: ~50%)"
+ )
+ logger.info(" -> EDGE OPPORTUNITY: Bet underdogs at this book")
+ elif biases["overestimates_underdogs"]:
+ logger.info("[WARNING] BIAS DETECTED: Overestimates underdogs")
+ logger.info(
+ f" -> Favorites cover {biases.get('favorite_cover_pct', 0):.1%} "
+ "(expected: ~50%)"
+ )
+ logger.info(" -> EDGE OPPORTUNITY: Bet favorites at this book")
+ else:
+ logger.info("[OK] No significant bias detected (within 48-52%)")
+
+ logger.info("")
+
+ except Exception as e:
+ logger.error(f"[ERROR] Analysis failed: {e}")
+
+ # Try totals analysis
+ if games_covered >= 30:
+ logger.info(f"[OK] Analyzing Totals Accuracy: {book_key}")
+ logger.info("-" * 70)
+
+ try:
+ metrics = analyzer.calculate_totals_accuracy(
+ book_key=book_key,
+ start_date=start_date,
+ end_date=end_date,
+ min_games=10,
+ )
+
+ logger.info(f"Sample size: {metrics['sample_size']} games")
+ logger.info(f"Mean Absolute Error (MAE): {metrics['mae']:.2f} points")
+ logger.info(f"Root Mean Squared Error (RMSE): {metrics['rmse']:.2f} points")
+ logger.info(f"Over percentage: {metrics['over_pct']:.1%}")
+ logger.info("")
+
+ # Check for bias
+ if abs(metrics["over_pct"] - 0.5) > 0.02: # >2% deviation
+ if metrics["over_pct"] < 0.48:
+ logger.info("[WARNING] BIAS: Games go under more than expected")
+ logger.info(" -> EDGE OPPORTUNITY: Bet unders at this book")
+ elif metrics["over_pct"] > 0.52:
+ logger.info("[WARNING] BIAS: Games go over more than expected")
+ logger.info(" -> EDGE OPPORTUNITY: Bet overs at this book")
+ else:
+ logger.info("[OK] No significant bias in totals")
+
+ logger.info("")
+
+ except Exception as e:
+ logger.debug(f"Totals analysis skipped: {e}")
+
+ # Try ranking bookmakers if we have multiple
+ if len(stats["bookmaker_coverage"]) >= 2:
+ logger.info("[OK] Ranking All Bookmakers (Spreads)")
+ logger.info("-" * 70)
+
+ try:
+ rankings = analyzer.rank_bookmakers(
+ market_type="spreads",
+ metric="mae",
+ min_games=10,
+ )
+
+ if len(rankings) > 0:
+ logger.info(
+ f"{'Rank':<6}{'Bookmaker':<20}{'MAE':<10}{'RMSE':<10}"
+ f"{'Fav Cover %':<15}{'Games':<8}"
+ )
+ logger.info("-" * 70)
+
+ for _, row in rankings.head(10).iterrows():
+ logger.info(
+ f"{int(row['rank']):<6}{row['book_key']:<20}"
+ f"{row['mae']:<10.2f}{row['rmse']:<10.2f}"
+ f"{row['favorite_cover_pct']:<15.1%}{int(row['sample_size']):<8}"
+ )
+
+ logger.info("")
+ logger.info(f"[OK] Most accurate: {rankings.iloc[0]['book_key']}")
+ logger.info(f"[OK] Least accurate: {rankings.iloc[-1]['book_key']}")
+
+ except Exception as e:
+ logger.debug(f"Ranking failed: {e}")
+
+ # Summary
+ logger.info("")
+ logger.info("=" * 70)
+ logger.info("[OK] Bookmaker accuracy analysis complete!")
+ logger.info("")
+ logger.info("KEY INSIGHTS:")
+ logger.info(" 1. MAE (Mean Absolute Error): Measures average prediction error")
+ logger.info(" - Lower is better (sharp books typically have MAE < 8 points)")
+ logger.info("")
+ logger.info(" 2. Directional Accuracy: Should be close to 50% (efficient market)")
+ logger.info(" - Deviations >2% indicate potential exploitable bias")
+ logger.info("")
+ logger.info(" 3. EDGE IDENTIFICATION:")
+ logger.info(" - If book overestimates favorites -> bet underdogs")
+ logger.info(" - If book overestimates underdogs -> bet favorites")
+ logger.info(" - If book overestimates overs -> bet unders")
+ logger.info(" - If book overestimates unders -> bet overs")
+ logger.info("=" * 70)
+
+ db.close()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/demo_market_features.py b/scripts/archive/2026-02-deprecated/demo_market_features.py
new file mode 100644
index 000000000..a19dd0101
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/demo_market_features.py
@@ -0,0 +1,115 @@
+"""Demonstration script for new market inefficiency features.
+
+This script shows the expanded feature set for ML models, including:
+- Sharp vs public divergence (Pinnacle vs FanDuel/DraftKings)
+- Steam move detection and line movement velocity
+- Market consensus and variance
+- Key number positioning
+
+Usage:
+ uv run python scripts/demo_market_features.py
+"""
+
+import logging
+from pathlib import Path
+
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+logging.basicConfig(level=logging.INFO, format="%(message)s")
+logger = logging.getLogger(__name__)
+
+
+def main() -> None:
+ """Demonstrate new market inefficiency features."""
+ # Paths
+ kenpom_path = Path("data/kenpom")
+ espn_path = Path("data/espn")
+ odds_db_path = Path("data/odds_api/odds_api.sqlite3")
+
+ # Check if data exists
+ if not odds_db_path.exists():
+ logger.warning(f"Database not found: {odds_db_path}")
+ logger.info("Run collect_hybrid.py first to gather data")
+ return
+
+ logger.info("[OK] Initializing FeatureEngineer with market features enabled...")
+
+ # Initialize
+ engineer = FeatureEngineer(
+ kenpom_path=kenpom_path, espn_path=espn_path, odds_db_path=odds_db_path
+ )
+
+ # Build dataset with all features
+ logger.info("\n[OK] Building spreads dataset (2025-12-01 to 2025-12-15)...")
+ logger.info("Features enabled: KenPom + bookmaker divergence + market inefficiency")
+
+ try:
+ X, y = engineer.build_spreads_dataset(
+ start_date="2025-12-01",
+ end_date="2025-12-15",
+ include_market_features=True,
+ include_bookmaker_features=True,
+ )
+
+ logger.info(f"\n[OK] Dataset shape: {X.shape[0]} games x {X.shape[1]} features")
+
+ # Show feature categories
+ logger.info("\nFeature categories:")
+
+ kenpom_features = [c for c in X.columns if "adj_" in c or "efg" in c or "sos" in c]
+ logger.info(f" - KenPom features: {len(kenpom_features)}")
+
+ bookmaker_features = [
+ c for c in X.columns if any(book in c for book in ["pinnacle", "fanduel", "draftkings"])
+ ]
+ logger.info(f" - Bookmaker-specific: {len(bookmaker_features)}")
+
+ market_features = [
+ c
+ for c in X.columns
+ if any(
+ keyword in c
+ for keyword in ["steam", "velocity", "consensus", "variance", "key_number"]
+ )
+ ]
+ logger.info(f" - Market signals: {len(market_features)}")
+
+ # Show sample of new features
+ logger.info("\nNew feature columns (sample):")
+ new_feature_cols = [
+ "pinnacle_closing_spread",
+ "fanduel_closing_spread",
+ "sharp_public_split",
+ "total_steam_moves",
+ "movement_velocity",
+ "spread_variance",
+ "near_key_number",
+ ]
+ for col in new_feature_cols:
+ if col in X.columns:
+ non_null = X[col].notna().sum()
+ logger.info(f" - {col:30s} ({non_null}/{len(X)} non-null)")
+
+ # Check target distribution
+ logger.info("\nTarget variable (favorite_covered):")
+ logger.info(f" - Favorites covered: {y.sum()} games")
+ logger.info(f" - Favorites failed: {len(y) - y.sum()} games")
+ logger.info(f" - Win rate: {y.mean():.1%}")
+
+ # Summary
+ logger.info("\n" + "=" * 60)
+ logger.info(f"[OK] Successfully built dataset with {X.shape[1]} features!")
+ logger.info(f" Feature count increased from ~30 to {X.shape[1]} features")
+ logger.info(" Dataset ready for XGBoost training")
+ logger.info("=" * 60)
+
+ except Exception as e:
+ logger.error(f"[ERROR] Failed to build dataset: {e}")
+ raise
+
+ finally:
+ engineer.close()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/demo_tracker.py b/scripts/archive/2026-02-deprecated/demo_tracker.py
new file mode 100644
index 000000000..d5ac5b3ce
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/demo_tracker.py
@@ -0,0 +1,59 @@
+"""
+Demo script showing betting tracker functionality with sample results.
+
+This demonstrates the full workflow with fictional game scores.
+"""
+
+import logging
+from pathlib import Path
+
+from betting_tracker import BettingTracker
+
+logger = logging.getLogger(__name__)
+
+
+def main() -> None:
+ """Run demo with sample results."""
+ logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+
+ print("\n" + "=" * 70)
+ print("BETTING TRACKER DEMO")
+ print("=" * 70)
+
+ # Initialize tracker
+ predictions_path = Path("data/analysis/combined_predictions_2026-02-03.csv")
+ tracker = BettingTracker(predictions_path, unit_size=100.0)
+
+ print("\n[INFO] Loaded predictions. Adding sample results...\n")
+
+ # Add some sample results (fictional scores for demo purposes)
+ sample_results = [
+ ("Miami Ohio", "Buffalo", 72, 85), # Buffalo wins by 13
+ ("Akron", "Eastern Michigan", 68, 82), # EMU wins by 14
+ ("Canisius", "Niagara", 75, 71), # Canisius wins by 4
+ ("Xavier", "Connecticut", 65, 88), # UConn wins by 23
+ ("Boston College", "Duke", 58, 92), # Duke wins by 34
+ ]
+
+ for away, home, away_score, home_score in sample_results:
+ tracker.add_game_result(away, home, away_score, home_score)
+
+ print("\n" + "-" * 70)
+ print("SAMPLE RESULTS ANALYSIS")
+ print("-" * 70)
+
+ # Show summary
+ tracker.print_summary()
+
+ # Save tracked results
+ output_path = tracker.save_results()
+ print(f"\n[OK] Demo results saved to {output_path}")
+
+ print("\n[INFO] To continue tracking:")
+ print(" 1. Use enter_results.py for interactive entry")
+ print(" 2. Or create CSV with scores and use --import-csv")
+ print(" 3. Run betting_dashboard.py for detailed analytics")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/export_complete_dataset.py b/scripts/archive/2026-02-deprecated/export_complete_dataset.py
new file mode 100644
index 000000000..354824b67
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/export_complete_dataset.py
@@ -0,0 +1,150 @@
+"""Export complete dataset with odds and scores to Parquet.
+
+Extracts all events with both betting odds and final scores, including:
+- Opening and closing lines (spreads, totals, moneylines)
+- Consensus metrics across bookmakers
+- Closing Line Value (CLV) calculations
+- Market efficiency indicators
+
+The exported dataset is ML-ready for model training and backtesting.
+
+Usage:
+ uv run python scripts/export_complete_dataset.py
+
+ # Custom output path
+ uv run python scripts/export_complete_dataset.py --output data/ml_ready_dataset.parquet
+
+ # Export as CSV instead
+ uv run python scripts/export_complete_dataset.py --format csv
+"""
+
+import argparse
+import logging
+import sqlite3
+from pathlib import Path
+
+import pandas as pd
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def export_complete_dataset(
+ db_path: Path,
+ query_path: Path,
+ output_path: Path,
+ output_format: str = "parquet",
+) -> None:
+ """Export complete dataset to file.
+
+ Args:
+ db_path: Path to SQLite database
+ query_path: Path to SQL query file
+ output_path: Path to save output file
+ output_format: Output format (parquet or csv)
+ """
+ logger.info(f"Reading query from {query_path}")
+ query = query_path.read_text()
+
+ logger.info(f"Executing query against {db_path}")
+ conn = sqlite3.connect(str(db_path))
+ df = pd.read_sql_query(query, conn)
+ conn.close()
+
+ logger.info(f"Query returned {len(df)} games with complete data")
+
+ if len(df) == 0:
+ logger.warning("No complete data found - nothing to export")
+ return
+
+ # Show summary statistics
+ logger.info(f"Date range: {df['game_date'].min()} to {df['game_date'].max()}")
+ logger.info(f"Unique teams: {pd.concat([df['home_team'], df['away_team']]).nunique()}")
+ logger.info(f"Average books per game (spread): {df['num_books_spread'].mean():.1f}")
+ logger.info(f"Average books per game (total): {df['num_books_total'].mean():.1f}")
+
+ # Market coverage (normalized column names)
+ spread_coverage = df["consensus_closing_spread_magnitude"].notna().sum()
+ total_coverage = df["consensus_closing_total"].notna().sum()
+ ml_coverage = df["closing_home_implied_prob"].notna().sum()
+
+ logger.info("Market coverage:")
+ logger.info(f" Spreads: {spread_coverage}/{len(df)} ({spread_coverage / len(df) * 100:.1f}%)")
+ logger.info(f" Totals: {total_coverage}/{len(df)} ({total_coverage / len(df) * 100:.1f}%)")
+ logger.info(f" Moneylines: {ml_coverage}/{len(df)} ({ml_coverage / len(df) * 100:.1f}%)")
+
+ # CLV summary (normalized)
+ logger.info("Closing Line Value (CLV) summary:")
+ logger.info(f" Mean home spread CLV: {df['home_spread_clv'].mean():.2f} points")
+ logger.info(f" Home cover rate: {df['home_covered_spread'].mean() * 100:.1f}%")
+ logger.info(f" Over rate: {df['went_over'].mean() * 100:.1f}%")
+ logger.info(f" Mean total error: {df['total_error'].mean():.2f} points")
+
+ # Save to file
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+
+ if output_format == "parquet":
+ df.to_parquet(output_path, index=False)
+ logger.info(f"[OK] Exported {len(df)} games to {output_path}")
+ elif output_format == "csv":
+ df.to_csv(output_path, index=False)
+ logger.info(f"[OK] Exported {len(df)} games to {output_path}")
+ else:
+ raise ValueError(f"Unsupported format: {output_format}")
+
+ # Show column list
+ logger.info(f"\nDataset contains {len(df.columns)} columns:")
+ for col in df.columns:
+ logger.info(f" - {col}")
+
+
+def main() -> None:
+ """Run dataset export."""
+ parser = argparse.ArgumentParser(description="Export complete dataset to Parquet or CSV")
+ parser.add_argument(
+ "--db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to SQLite database",
+ )
+ parser.add_argument(
+ "--query",
+ type=Path,
+ default=Path("sql/query_complete_dataset_normalized.sql"),
+ help="Path to SQL query file",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=Path("data/complete_dataset.parquet"),
+ help="Path to output file",
+ )
+ parser.add_argument(
+ "--format",
+ choices=["parquet", "csv"],
+ default="parquet",
+ help="Output format (default: parquet)",
+ )
+
+ args = parser.parse_args()
+
+ # Validate inputs
+ if not args.db.exists():
+ logger.error(f"Database not found: {args.db}")
+ return
+
+ if not args.query.exists():
+ logger.error(f"Query file not found: {args.query}")
+ return
+
+ # Export dataset
+ export_complete_dataset(
+ db_path=args.db,
+ query_path=args.query,
+ output_path=args.output,
+ output_format=args.format,
+ )
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/fix_complete_analysis.py b/scripts/archive/2026-02-deprecated/fix_complete_analysis.py
new file mode 100644
index 000000000..d3eeac925
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/fix_complete_analysis.py
@@ -0,0 +1,178 @@
+#!/usr/bin/env python3
+"""Fix complete_analysis file by replacing incorrect KenPom data.
+
+Loads the existing analysis file and replaces all KenPom ratings with
+correct season 2026 data from the database.
+"""
+
+import sqlite3
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+from rich.console import Console
+
+from sports_betting_edge.utils.team_matching import match_to_kenpom
+
+DB_PATH = Path("data/odds_api/odds_api.sqlite3")
+SEASON = 2026
+
+console = Console()
+
+
+def fix_analysis():
+ """Fix the complete analysis file."""
+ today = datetime.now().date().isoformat()
+ input_path = Path(f"data/analysis/complete_analysis_{today}_main_lines.csv")
+ output_path = Path(f"data/analysis/complete_analysis_{today}_CORRECTED.csv")
+
+ console.print(f"\n[bold cyan]Fixing Analysis for {today}[/bold cyan]\n")
+
+ if not input_path.exists():
+ console.print(f"[red][ERROR] File not found: {input_path}[/red]")
+ return
+
+ # Load existing analysis
+ console.print(f"[1/3] Loading existing analysis from {input_path}...")
+ df = pd.read_csv(input_path)
+ console.print(f" [OK] Loaded {len(df)} games")
+
+ # Load correct KenPom data for season 2026
+ console.print(f"\n[2/3] Loading correct KenPom ratings (season {SEASON})...")
+ conn = sqlite3.connect(DB_PATH)
+ kenpom_query = f"""
+ SELECT team, adj_em, adj_o, adj_d, adj_t, rank
+ FROM kp_pomeroy_ratings
+ WHERE season = {SEASON}
+ ORDER BY team
+ """
+ kenpom_df = pd.read_sql_query(kenpom_query, conn)
+ conn.close()
+
+ kenpom_teams = kenpom_df["team"].tolist()
+ kenpom_lookup = kenpom_df.set_index("team").to_dict(orient="index")
+ console.print(f" [OK] Loaded {len(kenpom_df)} teams")
+
+ # Fix each game
+ console.print("\n[3/3] Replacing incorrect KenPom data...")
+ fixed_count = 0
+ failed_count = 0
+ comparison = []
+
+ for idx, row in df.iterrows():
+ away_team = row["away_team"]
+ home_team = row["home_team"]
+
+ # Match to KenPom
+ away_kp = match_to_kenpom(away_team, kenpom_teams)
+ home_kp = match_to_kenpom(home_team, kenpom_teams)
+
+ if away_kp is None or home_kp is None:
+ console.print(f" [WARN] Failed to match: {away_team} @ {home_team}")
+ failed_count += 1
+ continue
+
+ # Get correct data
+ away_data = kenpom_lookup.get(away_kp, {})
+ home_data = kenpom_lookup.get(home_kp, {})
+
+ if not away_data or not home_data:
+ console.print(f" [WARN] No KenPom data for: {away_team} @ {home_team}")
+ failed_count += 1
+ continue
+
+ # Store old values for comparison
+ old_home_em = row.get("home_adjem")
+ old_away_em = row.get("away_adjem")
+ old_margin = row.get("kenpom_margin")
+
+ # Replace with correct data
+ df.at[idx, "away_adjoe"] = away_data.get("adj_o")
+ df.at[idx, "away_adjde"] = away_data.get("adj_d")
+ df.at[idx, "away_adjem"] = away_data.get("adj_em")
+ df.at[idx, "away_tempo"] = away_data.get("adj_t")
+
+ df.at[idx, "home_adjoe"] = home_data.get("adj_o")
+ df.at[idx, "home_adjde"] = home_data.get("adj_d")
+ df.at[idx, "home_adjem"] = home_data.get("adj_em")
+ df.at[idx, "home_tempo"] = home_data.get("adj_t")
+
+ # Recalculate KenPom margin
+ new_margin = home_data.get("adj_em", 0) - away_data.get("adj_em", 0)
+ df.at[idx, "kenpom_margin"] = new_margin
+
+ # Recalculate edge if we have spread
+ if pd.notna(row.get("home_spread")):
+ market_margin = -row["home_spread"]
+ discrepancy = new_margin - market_margin
+ df.at[idx, "discrepancy"] = discrepancy
+ df.at[idx, "abs_discrepancy"] = abs(discrepancy)
+
+ # Track significant changes
+ if pd.notna(old_margin) and pd.notna(new_margin):
+ margin_change = new_margin - old_margin
+ if abs(margin_change) > 5: # Changed by more than 5 points
+ comparison.append(
+ {
+ "game": f"{away_team} @ {home_team}",
+ "old_away_em": old_away_em,
+ "new_away_em": away_data.get("adj_em"),
+ "old_home_em": old_home_em,
+ "new_home_em": home_data.get("adj_em"),
+ "old_margin": old_margin,
+ "new_margin": new_margin,
+ "change": margin_change,
+ }
+ )
+
+ fixed_count += 1
+
+ console.print(f" [OK] Fixed {fixed_count} games")
+ if failed_count > 0:
+ console.print(f" [WARN] Failed to fix {failed_count} games")
+
+ # Save corrected data
+ df.to_csv(output_path, index=False)
+ console.print(f"\n[OK] Saved corrected analysis to {output_path}")
+
+ # Show significant changes
+ if comparison:
+ console.print("\n" + "=" * 80)
+ console.print("[bold]SIGNIFICANT CORRECTIONS (>5 point margin change):[/bold]")
+ console.print("=" * 80 + "\n")
+
+ for c in comparison:
+ console.print(f"\n{c['game']}")
+ console.print(f" Away AdjEM: {c['old_away_em']:+.1f} → {c['new_away_em']:+.1f}")
+ console.print(f" Home AdjEM: {c['old_home_em']:+.1f} → {c['new_home_em']:+.1f}")
+ console.print(f" KenPom Margin: {c['old_margin']:+.1f} → {c['new_margin']:+.1f}")
+ console.print(f" [bold]Change: {c['change']:+.1f} points[/bold]")
+
+ # Summary
+ console.print("\n" + "=" * 80)
+ console.print("[bold]CORRECTED EDGES:[/bold]")
+ console.print("=" * 80)
+
+ if "abs_discrepancy" in df.columns:
+ edges_df = df[df["abs_discrepancy"] >= 3.5].sort_values("abs_discrepancy", ascending=False)
+
+ console.print(f"\nGames with 3.5+ point edges: {len(edges_df)}")
+
+ if len(edges_df) > 0:
+ console.print("\n[bold]Top 10 Edges:[/bold]\n")
+ for i, (_, game) in enumerate(edges_df.head(10).iterrows(), 1):
+ spread = game.get("home_spread", 0)
+ kp_margin = game.get("kenpom_margin", 0)
+ edge = game.get("abs_discrepancy", 0)
+
+ console.print(f"{i:2d}. {game['away_team']:25s} @ {game['home_team']:25s}")
+ console.print(
+ f" Market: {spread:+.1f} | KenPom: {kp_margin:+.1f} | "
+ f"[bold green]Edge: {edge:.1f} pts[/bold green]"
+ )
+
+ console.print("\n[bold green][OK] Analysis corrected![/bold green]\n")
+
+
+if __name__ == "__main__":
+ fix_analysis()
diff --git a/scripts/archive/2026-02-deprecated/fix_team_mapping.py b/scripts/archive/2026-02-deprecated/fix_team_mapping.py
new file mode 100644
index 000000000..b8ca3d2cc
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/fix_team_mapping.py
@@ -0,0 +1,319 @@
+"""Fix team mapping issues by finding unmapped teams and suggesting mappings.
+
+Usage:
+ # List unmapped teams
+ uv run python scripts/fix_team_mapping.py --list
+
+ # Auto-fix simple cases (exact matches with different formatting)
+ uv run python scripts/fix_team_mapping.py --auto-fix
+
+ # Interactive mode for manual mapping
+ uv run python scripts/fix_team_mapping.py --interactive
+"""
+
+import argparse
+import logging
+import sqlite3
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.core.team_mapper import TeamMapper
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def get_unmapped_teams(
+ odds_db_path: Path,
+ team_mapping_path: Path,
+) -> tuple[list[str], TeamMapper]:
+ """Get list of unmapped teams from Odds API.
+
+ Args:
+ odds_db_path: Path to Odds API database
+ team_mapping_path: Path to team mapping parquet
+
+ Returns:
+ (unmapped_teams, team_mapper)
+ """
+ # Load team mapping
+ mapping_df = pd.read_parquet(team_mapping_path)
+ mapper = TeamMapper(mapping_df)
+
+ # Get all teams from Odds API
+ conn = sqlite3.connect(str(odds_db_path))
+ query = """
+ SELECT DISTINCT home_team as team FROM events
+ UNION
+ SELECT DISTINCT away_team as team FROM events
+ ORDER BY team
+ """
+ odds_teams = pd.read_sql_query(query, conn)["team"].tolist()
+ conn.close()
+
+ # Find unmapped teams
+ unmapped = []
+ for team in odds_teams:
+ kenpom_name = mapper.get_kenpom_name(team, source="odds_api")
+ # If kenpom_name equals team, it means no mapping was found
+ if kenpom_name == team and team not in mapping_df["kenpom_name"].values:
+ unmapped.append(team)
+
+ return unmapped, mapper
+
+
+def suggest_kenpom_match(
+ odds_team: str,
+ kenpom_teams: list[str],
+) -> list[tuple[str, float]]:
+ """Suggest potential KenPom matches for an Odds API team.
+
+ Args:
+ odds_team: Odds API team name
+ kenpom_teams: List of KenPom team names
+
+ Returns:
+ List of (kenpom_name, similarity_score) tuples, sorted by score
+ """
+ from difflib import SequenceMatcher
+
+ matches = []
+ for kenpom_team in kenpom_teams:
+ # Calculate similarity
+ ratio = SequenceMatcher(None, odds_team.lower(), kenpom_team.lower()).ratio()
+ matches.append((kenpom_team, ratio))
+
+ # Sort by similarity (highest first)
+ matches.sort(key=lambda x: x[1], reverse=True)
+
+ return matches[:5] # Top 5 matches
+
+
+def auto_fix_mappings(
+ unmapped_teams: list[str],
+ kenpom_teams: list[str],
+ mapping_df: pd.DataFrame,
+ threshold: float = 0.8,
+) -> pd.DataFrame:
+ """Automatically fix mappings with high similarity scores.
+
+ Args:
+ unmapped_teams: List of unmapped Odds API teams
+ kenpom_teams: List of KenPom team names
+ mapping_df: Current mapping DataFrame
+ threshold: Minimum similarity score to auto-fix
+
+ Returns:
+ Updated mapping DataFrame
+ """
+ new_mappings = []
+
+ for odds_team in unmapped_teams:
+ matches = suggest_kenpom_match(odds_team, kenpom_teams)
+ best_match, score = matches[0]
+
+ if score >= threshold:
+ logger.info(f"Auto-mapping: {odds_team} -> {best_match} (score: {score:.3f})")
+ new_mappings.append(
+ {
+ "kenpom_name": best_match,
+ "odds_api_name": odds_team,
+ "espn_name": "", # Will need to be filled manually
+ }
+ )
+ else:
+ logger.warning(
+ f"No confident match for: {odds_team} (best: {best_match}, score: {score:.3f})"
+ )
+
+ if new_mappings:
+ new_df = pd.DataFrame(new_mappings)
+ updated_df = pd.concat([mapping_df, new_df], ignore_index=True)
+ logger.info(f"Added {len(new_mappings)} automatic mappings")
+ return updated_df
+ else:
+ logger.info("No automatic mappings added (no matches above threshold)")
+ return mapping_df
+
+
+def interactive_mapping(
+ unmapped_teams: list[str],
+ kenpom_teams: list[str],
+ mapping_df: pd.DataFrame,
+) -> pd.DataFrame:
+ """Interactively map teams with user input.
+
+ Args:
+ unmapped_teams: List of unmapped Odds API teams
+ kenpom_teams: List of KenPom team names
+ mapping_df: Current mapping DataFrame
+
+ Returns:
+ Updated mapping DataFrame
+ """
+ new_mappings = []
+
+ for odds_team in unmapped_teams:
+ print(f"\n{'=' * 80}")
+ print(f"Odds API team: {odds_team}")
+ print(f"{'=' * 80}")
+
+ # Show suggestions
+ matches = suggest_kenpom_match(odds_team, kenpom_teams)
+ print("\nSuggested KenPom matches:")
+ for i, (kenpom_team, score) in enumerate(matches):
+ print(f" {i + 1}. {kenpom_team} (similarity: {score:.3f})")
+
+ print(" s. Skip this team")
+ print(" q. Quit and save")
+
+ choice = input("\nEnter choice (1-5, s, q): ").strip().lower()
+
+ if choice == "q":
+ break
+ elif choice == "s":
+ continue
+ elif choice.isdigit() and 1 <= int(choice) <= 5:
+ kenpom_name = matches[int(choice) - 1][0]
+ print(f"\nMapping: {odds_team} -> {kenpom_name}")
+ confirm = input("Confirm? (y/n): ").strip().lower()
+ if confirm == "y":
+ new_mappings.append(
+ {
+ "kenpom_name": kenpom_name,
+ "odds_api_name": odds_team,
+ "espn_name": "", # Will need to be filled manually
+ }
+ )
+ logger.info(f"Added mapping: {odds_team} -> {kenpom_name}")
+
+ if new_mappings:
+ new_df = pd.DataFrame(new_mappings)
+ updated_df = pd.concat([mapping_df, new_df], ignore_index=True)
+ logger.info(f"Added {len(new_mappings)} manual mappings")
+ return updated_df
+ else:
+ logger.info("No new mappings added")
+ return mapping_df
+
+
+def main() -> None:
+ """Fix team mapping issues."""
+ parser = argparse.ArgumentParser(description="Fix team mapping issues")
+ parser.add_argument(
+ "--odds-db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to Odds API database",
+ )
+ parser.add_argument(
+ "--team-mapping",
+ type=Path,
+ default=Path("data/staging/mappings/team_mapping.parquet"),
+ help="Path to team mapping file",
+ )
+ parser.add_argument(
+ "--kenpom-path",
+ type=Path,
+ default=Path("data/kenpom"),
+ help="Path to KenPom data directory",
+ )
+ parser.add_argument(
+ "--list",
+ action="store_true",
+ help="List unmapped teams",
+ )
+ parser.add_argument(
+ "--auto-fix",
+ action="store_true",
+ help="Automatically fix high-confidence matches",
+ )
+ parser.add_argument(
+ "--interactive",
+ action="store_true",
+ help="Interactive mapping mode",
+ )
+ parser.add_argument(
+ "--threshold",
+ type=float,
+ default=0.8,
+ help="Similarity threshold for auto-fix (0-1)",
+ )
+
+ args = parser.parse_args()
+
+ # Get unmapped teams
+ unmapped, mapper = get_unmapped_teams(args.odds_db, args.team_mapping)
+
+ if len(unmapped) == 0:
+ logger.info("[OK] All teams are mapped!")
+ return
+
+ logger.info(f"Found {len(unmapped)} unmapped teams")
+
+ # Load KenPom teams
+ kenpom_ratings_path = args.kenpom_path / "ratings" / "season" / "ratings_2026.parquet"
+ if not kenpom_ratings_path.exists():
+ logger.error(f"KenPom ratings not found: {kenpom_ratings_path}")
+ return
+
+ kenpom_df = pd.read_parquet(kenpom_ratings_path)
+ kenpom_teams = kenpom_df["TeamName"].unique().tolist()
+
+ # List mode
+ if args.list:
+ logger.info("\nUnmapped Odds API teams:")
+ for team in sorted(unmapped):
+ # Show suggestions
+ matches = suggest_kenpom_match(team, kenpom_teams)
+ best_match, score = matches[0]
+ logger.info(f" {team:40s} -> {best_match} (score: {score:.3f})")
+ return
+
+ # Load current mapping
+ mapping_df = pd.read_parquet(args.team_mapping)
+
+ # Auto-fix mode
+ if args.auto_fix:
+ updated_df = auto_fix_mappings(unmapped, kenpom_teams, mapping_df, args.threshold)
+ if len(updated_df) > len(mapping_df):
+ # Save backup
+ backup_path = args.team_mapping.with_suffix(".backup.parquet")
+ mapping_df.to_parquet(backup_path)
+ logger.info(f"Backup saved to: {backup_path}")
+
+ # Save updated mapping
+ updated_df.to_parquet(args.team_mapping)
+ logger.info(f"Updated mapping saved to: {args.team_mapping}")
+ logger.info(f"Added {len(updated_df) - len(mapping_df)} new mappings")
+ return
+
+ # Interactive mode
+ if args.interactive:
+ updated_df = interactive_mapping(unmapped, kenpom_teams, mapping_df)
+ if len(updated_df) > len(mapping_df):
+ # Save backup
+ backup_path = args.team_mapping.with_suffix(".backup.parquet")
+ mapping_df.to_parquet(backup_path)
+ logger.info(f"Backup saved to: {backup_path}")
+
+ # Save updated mapping
+ updated_df.to_parquet(args.team_mapping)
+ logger.info(f"Updated mapping saved to: {args.team_mapping}")
+ logger.info(f"Added {len(updated_df) - len(mapping_df)} new mappings")
+ return
+
+ # Default: just list
+ logger.info("\nUnmapped Odds API teams:")
+ for team in sorted(unmapped):
+ logger.info(f" {team}")
+
+ logger.info(
+ "\nUse --list to see suggestions, --auto-fix to fix automatically, "
+ "or --interactive for manual mapping"
+ )
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/force_update_views.py b/scripts/archive/2026-02-deprecated/force_update_views.py
new file mode 100644
index 000000000..70bd31831
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/force_update_views.py
@@ -0,0 +1,85 @@
+"""Force update SQL views in Odds API database.
+
+This script drops and recreates all views to ensure they use latest definitions.
+Run this after updating sql/create_normalized_views.sql.
+
+Usage:
+ uv run python scripts/force_update_views.py
+"""
+
+import sqlite3
+from pathlib import Path
+
+# Connect to database
+db_path = Path("data/odds_api/odds_api.sqlite3")
+conn = sqlite3.connect(str(db_path))
+
+# Enable write-ahead logging checkpoint
+conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
+
+# Drop all views in reverse dependency order
+views = [
+ "ml_line_features",
+ "bookmaker_consensus",
+ "spread_movements",
+ "canonical_moneylines",
+ "canonical_totals",
+ "canonical_spreads",
+]
+
+print("Dropping existing views...")
+for view in views:
+ try:
+ conn.execute(f"DROP VIEW IF EXISTS {view}")
+ print(f" [OK] Dropped {view}")
+ except Exception as e:
+ print(f" [ERROR] Could not drop {view}: {e}")
+
+conn.commit()
+
+# Recreate views from SQL file
+print("\nRecreating views...")
+views_sql = Path("sql/create_normalized_views.sql")
+
+if not views_sql.exists():
+ print(f"[ERROR] SQL file not found: {views_sql}")
+ exit(1)
+
+with open(views_sql) as f:
+ sql = f.read()
+
+# Check for STDEV (should not be present)
+if "STDEV" in sql:
+ print("[WARNING] SQL file contains STDEV - this won't work in SQLite!")
+ print("Please replace STDEV with (MAX - MIN) for variance")
+ exit(1)
+
+try:
+ conn.executescript(sql)
+ conn.commit()
+ print("[OK] All views recreated successfully")
+except Exception as e:
+ print(f"[ERROR] Failed to create views: {e}")
+ exit(1)
+
+# Verify views were created
+print("\nVerifying views...")
+for view in views:
+ result = conn.execute(
+ f"SELECT sql FROM sqlite_master WHERE type='view' AND name='{view}'"
+ ).fetchone()
+
+ if result:
+ sql_def = result[0]
+ if "STDEV" in sql_def:
+ print(f" [ERROR] {view} still contains STDEV!")
+ else:
+ print(f" [OK] {view} created successfully")
+ else:
+ print(f" [ERROR] {view} not found in database")
+
+# Checkpoint and close
+conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
+conn.close()
+
+print("\n[OK] Database views updated successfully!")
diff --git a/scripts/archive/2026-02-deprecated/live_line_monitor_overtime.py b/scripts/archive/2026-02-deprecated/live_line_monitor_overtime.py
new file mode 100644
index 000000000..6d0122e41
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/live_line_monitor_overtime.py
@@ -0,0 +1,382 @@
+#!/usr/bin/env python3
+"""Live line movement monitor using overtime.ag data.
+
+Tracks all games from overtime.ag parquet files and highlights movements.
+"""
+
+import json
+import logging
+import time
+from datetime import datetime
+from glob import glob
+from pathlib import Path
+
+import pandas as pd
+from rich.console import Console
+from rich.live import Live
+from rich.table import Table
+from rich.text import Text
+
+# Configuration
+OVERTIME_DATA_DIR = Path("data/raw")
+REFRESH_INTERVAL = 15 # seconds
+STEAM_THRESHOLD = 1.0 # Point movement to highlight as "steam"
+EDGE_THRESHOLD = 3.5 # KenPom edge to highlight
+LOG_DIR = Path("data/logs")
+
+# Setup logging
+LOG_DIR.mkdir(parents=True, exist_ok=True)
+log_file = LOG_DIR / f"line_movements_overtime_{datetime.now().date().isoformat()}.log"
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(message)s",
+ handlers=[logging.FileHandler(log_file, mode="a")],
+)
+logger = logging.getLogger(__name__)
+
+
+class OvertimeLineMonitor:
+ """Monitor overtime.ag line movements."""
+
+ def __init__(self):
+ self.console = Console()
+ self.previous_lines: dict[str, dict] = {}
+ self.kenpom_edges: dict[str, float] = {}
+ self.load_kenpom_edges()
+
+ def log_movement(
+ self,
+ game_id: str,
+ away_team: str,
+ home_team: str,
+ game_time: str,
+ market_type: str,
+ old_value: float,
+ new_value: float,
+ movement: float,
+ kenpom_edge: float = 0.0,
+ ):
+ """Log a line movement event to file."""
+ log_entry = {
+ "timestamp": datetime.now().isoformat(),
+ "game_id": game_id,
+ "game": f"{away_team} @ {home_team}",
+ "game_time": game_time,
+ "market_type": market_type,
+ "old_value": old_value,
+ "new_value": new_value,
+ "movement": movement,
+ "is_steam": abs(movement) >= STEAM_THRESHOLD,
+ "kenpom_edge": kenpom_edge,
+ "is_edge_opportunity": kenpom_edge >= EDGE_THRESHOLD,
+ "book": "overtime.ag",
+ }
+ logger.info(json.dumps(log_entry))
+
+ def load_kenpom_edges(self):
+ """Load KenPom edges from today's analysis."""
+ today = datetime.now().date().isoformat()
+ edges_path = Path(f"data/analysis/edge_analysis_{today}.csv")
+
+ if edges_path.exists():
+ df = pd.read_csv(edges_path)
+ for _, row in df.iterrows():
+ # Create multiple key formats for matching
+ key1 = f"{row['away_team']}@{row['home_team']}"
+ key2 = f"{row['away_team']} @ {row['home_team']}"
+ edge = row.get("abs_discrepancy", 0)
+ self.kenpom_edges[key1] = edge
+ self.kenpom_edges[key2] = edge
+
+ def get_latest_overtime_data(self) -> pd.DataFrame | None:
+ """Get latest overtime.ag data from parquet files."""
+ # Find latest parquet file
+ pattern = str(OVERTIME_DATA_DIR / "overtime_lines_*.parquet")
+ files = sorted(glob(pattern), reverse=True)
+
+ if not files:
+ return None
+
+ # Load most recent file
+ df = pd.read_parquet(files[0])
+
+ # Filter to NCAAB and today's games
+ if "sport" in df.columns:
+ df = df[df["sport"].str.contains("Basketball", case=False, na=False)]
+
+ return df
+
+ def calculate_movements(self, current: pd.DataFrame) -> dict[str, dict]:
+ """Calculate line movements from previous snapshot."""
+ movements = {}
+
+ for game_id in current["game_id"].unique():
+ game_data = current[current["game_id"] == game_id].iloc[0]
+
+ if game_id not in self.previous_lines:
+ continue
+
+ prev = self.previous_lines[game_id]
+ game_movements = {}
+
+ # Get game info for logging
+ away_team = game_data.get("away_team", "Unknown")
+ home_team = game_data.get("home_team", "Unknown")
+ game_time = game_data.get("game_time", "")
+ game_key = f"{away_team}@{home_team}"
+ kenpom_edge = self.kenpom_edges.get(game_key, 0.0)
+
+ # Check spread movement
+ if "home_spread" in game_data and "home_spread" in prev:
+ current_spread = game_data["home_spread"]
+ prev_spread = prev["home_spread"]
+
+ if pd.notna(current_spread) and pd.notna(prev_spread):
+ old_value = float(prev_spread)
+ new_value = float(current_spread)
+ movement = new_value - old_value
+
+ # Log all movements (not just steam)
+ if movement != 0:
+ self.log_movement(
+ game_id=game_id,
+ away_team=away_team,
+ home_team=home_team,
+ game_time=str(game_time),
+ market_type="spread",
+ old_value=old_value,
+ new_value=new_value,
+ movement=movement,
+ kenpom_edge=kenpom_edge,
+ )
+
+ # Track steam for display
+ if abs(movement) >= STEAM_THRESHOLD:
+ game_movements["spread"] = movement
+
+ # Check total movement
+ if "total" in game_data and "total" in prev:
+ current_total = game_data["total"]
+ prev_total = prev["total"]
+
+ if pd.notna(current_total) and pd.notna(prev_total):
+ old_value = float(prev_total)
+ new_value = float(current_total)
+ movement = new_value - old_value
+
+ # Log all movements (not just steam)
+ if movement != 0:
+ self.log_movement(
+ game_id=game_id,
+ away_team=away_team,
+ home_team=home_team,
+ game_time=str(game_time),
+ market_type="total",
+ old_value=old_value,
+ new_value=new_value,
+ movement=movement,
+ kenpom_edge=kenpom_edge,
+ )
+
+ # Track steam for display
+ if abs(movement) >= STEAM_THRESHOLD:
+ game_movements["total"] = movement
+
+ if game_movements:
+ movements[game_id] = game_movements
+
+ return movements
+
+ def store_current_lines(self, current: pd.DataFrame):
+ """Store current lines for movement tracking."""
+ for game_id in current["game_id"].unique():
+ game_data = current[current["game_id"] == game_id].iloc[0]
+
+ self.previous_lines[game_id] = {
+ "home_spread": game_data.get("home_spread"),
+ "total": game_data.get("total"),
+ }
+
+ def create_display_table(self, games: pd.DataFrame, movements: dict[str, dict]) -> Table:
+ """Create rich table for display."""
+ timestamp = datetime.now().strftime("%I:%M:%S %p")
+ table = Table(
+ title=f"[bold]LIVE LINE MONITOR - overtime.ag[/bold] - {timestamp}",
+ show_header=True,
+ header_style="bold cyan",
+ title_style="bold cyan",
+ )
+
+ table.add_column("Time", style="dim", width=8)
+ table.add_column("Game", width=40)
+ table.add_column("Spread", justify="right", width=15)
+ table.add_column("Move", justify="center", width=8)
+ table.add_column("Total", justify="right", width=12)
+ table.add_column("Move", justify="center", width=8)
+ table.add_column("Edge", justify="right", width=8)
+
+ for _, game in games.iterrows():
+ game_id = game["game_id"]
+ away_team = game.get("away_team", "Unknown")
+ home_team = game.get("home_team", "Unknown")
+
+ # Get game time
+ game_time = game.get("game_time", "")
+ if isinstance(game_time, str) and game_time:
+ try:
+ dt = pd.to_datetime(game_time)
+ time_str = dt.strftime("%I:%M %p")
+ except (ValueError, TypeError):
+ time_str = game_time[:8] if len(game_time) >= 8 else game_time
+ else:
+ time_str = "N/A"
+
+ # Get KenPom edge
+ game_key = f"{away_team}@{home_team}"
+ kp_edge = self.kenpom_edges.get(game_key, 0)
+
+ # Format spread
+ home_spread = game.get("home_spread")
+ home_juice = game.get("home_spread_juice")
+ spread_str = "N/A"
+ if pd.notna(home_spread) and pd.notna(home_juice):
+ spread_str = f"{home_spread:+.1f} ({home_juice:+.0f})"
+ elif pd.notna(home_spread):
+ spread_str = f"{home_spread:+.1f}"
+
+ # Format total
+ total = game.get("total")
+ over_juice = game.get("over_juice")
+ total_str = "N/A"
+ if pd.notna(total) and pd.notna(over_juice):
+ total_str = f"{total:.1f} ({over_juice:+.0f})"
+ elif pd.notna(total):
+ total_str = f"{total:.1f}"
+
+ # Check movements
+ spread_move = ""
+ total_move = ""
+ if game_id in movements:
+ mvts = movements[game_id]
+ if "spread" in mvts:
+ move = mvts["spread"]
+ spread_move = Text(
+ f"{move:+.1f}",
+ style="bold red" if abs(move) >= 2 else "yellow",
+ )
+ if "total" in mvts:
+ move = mvts["total"]
+ total_move = Text(
+ f"{move:+.1f}",
+ style="bold red" if abs(move) >= 2 else "yellow",
+ )
+
+ # Format edge
+ edge_str = ""
+ edge_style = "white"
+ if kp_edge >= 7:
+ edge_str = f"{kp_edge:.1f}"
+ edge_style = "bold green"
+ elif kp_edge >= EDGE_THRESHOLD:
+ edge_str = f"{kp_edge:.1f}"
+ edge_style = "green"
+
+ # Add row
+ game_str = f"{away_team}\n@ {home_team}"
+ table.add_row(
+ time_str,
+ game_str,
+ spread_str,
+ spread_move or "-",
+ total_str,
+ total_move or "-",
+ Text(edge_str, style=edge_style) if edge_str else "-",
+ )
+
+ return table
+
+ def run(self):
+ """Run live monitor."""
+ self.console.print("\n[bold cyan]Starting Live Line Monitor - overtime.ag[/bold cyan]")
+ self.console.print(f"Refresh interval: {REFRESH_INTERVAL}s")
+ self.console.print(f"Steam threshold: {STEAM_THRESHOLD} points")
+ self.console.print(f"Edge threshold: {EDGE_THRESHOLD}+ points")
+ self.console.print(f"KenPom edges loaded: {len(self.kenpom_edges) // 2} games")
+ self.console.print(f"Logging to: {log_file}\n")
+
+ # Log monitoring session start
+ logger.info(
+ json.dumps(
+ {
+ "timestamp": datetime.now().isoformat(),
+ "event": "monitor_started",
+ "config": {
+ "refresh_interval": REFRESH_INTERVAL,
+ "steam_threshold": STEAM_THRESHOLD,
+ "edge_threshold": EDGE_THRESHOLD,
+ "book": "overtime.ag",
+ },
+ }
+ )
+ )
+
+ with Live(console=self.console, refresh_per_second=1) as live:
+ while True:
+ try:
+ # Get current lines
+ current = self.get_latest_overtime_data()
+
+ if current is None or len(current) == 0:
+ live.update(
+ Text(
+ "No overtime.ag data found - check collection is running",
+ style="yellow",
+ )
+ )
+ time.sleep(REFRESH_INTERVAL)
+ continue
+
+ # Calculate movements
+ movements = self.calculate_movements(current)
+
+ # Create display
+ table = self.create_display_table(current, movements)
+
+ # Add legend and stats
+ legend = Text("\n")
+ legend.append("Legend: ", style="dim")
+ legend.append("GREEN", style="bold green")
+ legend.append(" = KenPom edge | ", style="dim")
+ legend.append("RED/YELLOW", style="bold red")
+ legend.append(" = Line movement | ", style="dim")
+
+ if movements:
+ legend.append(f"\n{len(movements)} STEAM MOVES DETECTED", style="bold red")
+
+ # Update display
+ live.update(table)
+ live.console.print(legend)
+
+ # Store for next iteration
+ self.store_current_lines(current)
+
+ # Wait
+ time.sleep(REFRESH_INTERVAL)
+
+ except KeyboardInterrupt:
+ self.console.print("\n[yellow]Shutting down monitor...[/yellow]")
+ break
+ except Exception as e:
+ self.console.print(f"[red]Error: {e}[/red]")
+ time.sleep(REFRESH_INTERVAL)
+
+
+def main():
+ """Main entry point."""
+ monitor = OvertimeLineMonitor()
+ monitor.run()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/manual_overtime_entry.py b/scripts/archive/2026-02-deprecated/manual_overtime_entry.py
new file mode 100644
index 000000000..7f8580151
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/manual_overtime_entry.py
@@ -0,0 +1,220 @@
+"""Manual entry script for overtime.ag daily data.
+
+Use this for quick manual data entry until full automation is implemented.
+
+Usage:
+ uv run python scripts/manual_overtime_entry.py
+"""
+
+import logging
+import sys
+from datetime import datetime
+from decimal import Decimal
+from pathlib import Path
+
+# Add src to path
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+
+from sports_betting_edge.adapters.overtime_scraper import OvertimeScraperAdapter
+from sports_betting_edge.config.logging import configure_logging
+from sports_betting_edge.core.tracking.overtime import (
+ AccountBalance,
+ DailyFigure,
+ DailyFiguresSnapshot,
+ OpenBet,
+ OpenBetsSnapshot,
+ OvertimeSnapshot,
+)
+from sports_betting_edge.services.overtime_tracker import OvertimeTrackerService
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+def get_decimal_input(prompt: str) -> Decimal:
+ """Get decimal input from user."""
+ while True:
+ try:
+ value = input(prompt).strip().replace("$", "").replace(",", "")
+ return Decimal(value)
+ except Exception as e:
+ print(f"Invalid input: {e}. Please try again.")
+
+
+def get_int_input(prompt: str) -> int:
+ """Get integer input from user."""
+ while True:
+ try:
+ return int(input(prompt).strip())
+ except Exception:
+ print("Invalid input. Please enter a number.")
+
+
+def manual_entry() -> OvertimeSnapshot:
+ """Manually enter overtime.ag data."""
+ print("\n" + "=" * 80)
+ print("OVERTIME.AG MANUAL DATA ENTRY")
+ print("=" * 80)
+ print("")
+
+ # Account Balance
+ print("[1/3] ACCOUNT BALANCE")
+ print("-" * 40)
+ balance = get_decimal_input("Current Balance: $")
+ credit_limit = get_decimal_input("Credit Limit: $")
+ pending = get_decimal_input("Pending: $")
+ available = get_decimal_input("Available Balance: $")
+ casino = get_decimal_input("Casino Balance (press Enter for $0): $") or Decimal("0.00")
+
+ account_balance = AccountBalance(
+ balance=balance,
+ credit_limit=credit_limit,
+ pending=pending,
+ available_balance=available,
+ casino_balance=casino,
+ )
+
+ # Open Bets
+ print("\n[2/3] OPEN BETS")
+ print("-" * 40)
+ num_bets = get_int_input("Number of open bets: ")
+
+ bets = []
+ for i in range(num_bets):
+ print(f"\nBet #{i + 1}:")
+ ticket = input(" Ticket Number: ").strip()
+ bet_type = input(" Type (Spread/Total Points/Money Line): ").strip()
+ details = input(" Details (full string): ").strip()
+ risk = get_decimal_input(" Risk Amount: $")
+ to_win = get_decimal_input(" To Win: $")
+
+ bet = OpenBet(
+ ticket_number=ticket,
+ accepted_date=datetime.now(), # Use current time
+ bet_type=bet_type,
+ details=details,
+ risk_amount=risk,
+ to_win_amount=to_win,
+ )
+ bets.append(bet)
+
+ total_risk = sum(bet.risk_amount for bet in bets)
+ total_to_win = sum(bet.to_win_amount for bet in bets)
+
+ open_bets = OpenBetsSnapshot(
+ bets=bets,
+ total_risk=total_risk,
+ total_to_win=total_to_win,
+ )
+
+ # Daily Figures - Current Week
+ print("\n[3/3] DAILY FIGURES - CURRENT WEEK")
+ print("-" * 40)
+ starting_bal = get_decimal_input("Starting Balance: $")
+ monday = get_decimal_input("Monday P&L: $")
+ tuesday = get_decimal_input("Tuesday P&L (press Enter for $0): $") or Decimal("0.00")
+ wednesday = get_decimal_input("Wednesday P&L (press Enter for $0): $") or Decimal("0.00")
+ thursday = get_decimal_input("Thursday P&L (press Enter for $0): $") or Decimal("0.00")
+ friday = get_decimal_input("Friday P&L (press Enter for $0): $") or Decimal("0.00")
+ saturday = get_decimal_input("Saturday P&L (press Enter for $0): $") or Decimal("0.00")
+ sunday = get_decimal_input("Sunday P&L (press Enter for $0): $") or Decimal("0.00")
+
+ week_total = monday + tuesday + wednesday + thursday + friday + saturday + sunday
+ payments = get_decimal_input("Payments (press Enter for $0): $") or Decimal("0.00")
+ ending_bal = starting_bal + week_total + payments
+
+ current_week = DailyFigure(
+ date=datetime.now(),
+ starting_balance=starting_bal,
+ monday=monday,
+ tuesday=tuesday,
+ wednesday=wednesday,
+ thursday=thursday,
+ friday=friday,
+ saturday=saturday,
+ sunday=sunday,
+ week_total=week_total,
+ payments=payments,
+ ending_balance=ending_bal,
+ )
+
+ # For simplicity, create empty last week data
+ last_week = DailyFigure(
+ date=datetime.now(),
+ starting_balance=Decimal("0.00"),
+ week_total=Decimal("0.00"),
+ ending_balance=Decimal("0.00"),
+ )
+
+ daily_figures = DailyFiguresSnapshot(
+ current_week=current_week,
+ last_week=last_week,
+ past_weeks=[],
+ )
+
+ # Create complete snapshot
+ snapshot = OvertimeSnapshot(
+ account_balance=account_balance,
+ open_bets=open_bets,
+ daily_figures=daily_figures,
+ )
+
+ return snapshot
+
+
+def main():
+ """Main execution function."""
+ print("\nThis script allows you to manually enter overtime.ag data")
+ print("and save it to parquet files for analysis.")
+ print("\nPress Ctrl+C at any time to cancel.\n")
+
+ try:
+ snapshot = manual_entry()
+
+ # Setup service
+ project_root = Path(__file__).parent.parent
+ data_dir = project_root / "data" / "overtime"
+
+ # Note: OvertimeScraperAdapter is not needed for manual entry
+ # We create a dummy adapter
+ class DummyClient:
+ pass
+
+ scraper = OvertimeScraperAdapter(DummyClient())
+ service = OvertimeTrackerService(scraper, data_dir)
+
+ # Save snapshot
+ print("\n" + "=" * 80)
+ print("SAVING DATA...")
+ print("=" * 80)
+
+ saved_paths = asyncio.run(service.save_full_snapshot(snapshot))
+
+ print("\n[OK] Data saved successfully!")
+ print("\nFiles created:")
+ for snap_type, path in saved_paths.items():
+ print(f" - {snap_type}: {path}")
+
+ print("\n" + "=" * 80)
+ print("SUMMARY")
+ print("=" * 80)
+ print(f"Account Balance: ${snapshot.account_balance.balance}")
+ print(
+ f"Open Bets: {len(snapshot.open_bets.bets)} ({snapshot.open_bets.total_risk} at risk)"
+ )
+ print(f"Week Total P&L: ${snapshot.daily_figures.current_week.week_total}")
+ print("=" * 80)
+
+ except KeyboardInterrupt:
+ print("\n\n[CANCELLED] Data entry cancelled by user.")
+ sys.exit(0)
+ except Exception as e:
+ logger.error(f"Error during manual entry: {e}", exc_info=True)
+ print(f"\n[ERROR] {e}")
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ import asyncio
+
+ main()
diff --git a/scripts/archive/2026-02-deprecated/monitor_live_lines.py b/scripts/archive/2026-02-deprecated/monitor_live_lines.py
new file mode 100644
index 000000000..892281f92
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/monitor_live_lines.py
@@ -0,0 +1,433 @@
+"""Monitor live Overtime.ag line movements with ML predictions.
+
+Combines real-time SignalR line streaming with XGBoost model predictions to identify
+value betting opportunities. Displays side-by-side comparison of market lines vs
+model predictions, highlighting edges.
+
+Prerequisites:
+ 1. Chrome with remote debugging:
+ chrome.exe --remote-debugging-port=9222 \
+ --user-data-dir=%USERPROFILE%\.chrome-profiles\overtime-ag
+
+ 2. overtime.ag logged in and navigated to College Basketball
+
+ 3. Today's odds collected:
+ uv run python scripts/collect_daily_data.py
+
+Usage:
+ # Monitor with today's predictions
+ uv run python scripts/monitor_live_lines.py
+
+ # Generate predictions first, then monitor
+ uv run python scripts/predict_today.py && \
+ uv run python scripts/monitor_live_lines.py
+
+ # Monitor for specific duration (1 hour)
+ uv run python scripts/monitor_live_lines.py --duration 3600
+
+ # Custom minimum edge threshold (10%)
+ uv run python scripts/monitor_live_lines.py --min-edge 0.10
+
+Output:
+ Live terminal display showing:
+ - Line movements as they happen
+ - Model predictions for each game
+ - Edge calculation (model prob - market prob)
+ - Value alerts when edge exceeds threshold
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+import sys
+from collections import defaultdict
+from datetime import date
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+
+# Add project root to path
+PROJECT_ROOT = Path(__file__).parent.parent
+sys.path.insert(0, str(PROJECT_ROOT / "src"))
+
+from sports_betting_edge.adapters.overtime_ag import ( # noqa: E402
+ OvertimeSignalRClient,
+)
+from sports_betting_edge.core.exceptions import ConfigurationError # noqa: E402
+from sports_betting_edge.core.models import MarketType # noqa: E402
+
+# Setup logging
+logging.basicConfig(
+ level=logging.INFO,
+ format="[%(asctime)s] %(levelname)s - %(message)s",
+ datefmt="%H:%M:%S",
+)
+logger = logging.getLogger(__name__)
+
+
+class LiveLineMonitor:
+ """Monitors live lines and compares with model predictions."""
+
+ def __init__(
+ self,
+ predictions_path: Path,
+ min_edge: float = 0.05,
+ team_mapper: dict[str, str] | None = None,
+ ):
+ """Initialize the monitor.
+
+ Args:
+ predictions_path: Path to today's predictions CSV
+ min_edge: Minimum edge to trigger value alert (default: 5%)
+ team_mapper: Overtime team name -> predictions team name mapping
+ """
+ self.predictions_path = predictions_path
+ self.min_edge = min_edge
+ self.team_mapper = team_mapper or {}
+
+ # Load predictions
+ self.predictions = self._load_predictions()
+
+ # Track line history
+ self.line_history: dict[int, list[dict[str, Any]]] = defaultdict(list)
+ self.steam_count: dict[int, int] = defaultdict(int)
+ self.total_changes = 0
+ self.value_alerts = 0
+
+ def _load_predictions(self) -> pd.DataFrame:
+ """Load today's predictions.
+
+ Returns:
+ Predictions DataFrame
+
+ Raises:
+ FileNotFoundError: If predictions file doesn't exist
+ """
+ if not self.predictions_path.exists():
+ raise FileNotFoundError(
+ f"Predictions not found: {self.predictions_path}\n"
+ f"Run: uv run python scripts/predict_today.py"
+ )
+
+ df = pd.read_csv(self.predictions_path)
+ logger.info(f"Loaded predictions for {len(df)} games")
+
+ # Display summary
+ logger.info("\n" + "=" * 80)
+ logger.info("TODAY'S PREDICTIONS LOADED")
+ logger.info("=" * 80)
+
+ for _, game in df.iterrows():
+ logger.info(
+ f"{game['away_team']} @ {game['home_team']} | "
+ f"Spread: {game['favorite_team']} -{game['spread_magnitude']} "
+ f"(Fav: {game['favorite_cover_prob']:.1%}) | "
+ f"Total: {game['total_points']} (Over: {game['over_prob']:.1%})"
+ )
+
+ logger.info("=" * 80 + "\n")
+
+ return df
+
+ def get_prediction_for_team(
+ self, team_name: str, market_type: MarketType
+ ) -> dict[str, Any] | None:
+ """Get prediction for a specific team/game.
+
+ Args:
+ team_name: Team name from Overtime.ag
+ market_type: SPREAD, TOTAL, or MONEYLINE
+
+ Returns:
+ Prediction dict or None if not found
+ """
+ # Map Overtime team name to predictions team name
+ mapped_name = self.team_mapper.get(team_name, team_name)
+
+ # Find game involving this team
+ game = self.predictions[
+ (self.predictions["home_team"] == mapped_name)
+ | (self.predictions["away_team"] == mapped_name)
+ | (self.predictions["favorite_team"] == mapped_name)
+ | (self.predictions["underdog_team"] == mapped_name)
+ ]
+
+ if len(game) == 0:
+ return None
+
+ game = game.iloc[0]
+
+ if market_type == MarketType.SPREAD:
+ return {
+ "favorite_team": game["favorite_team"],
+ "underdog_team": game["underdog_team"],
+ "spread_magnitude": game["spread_magnitude"],
+ "favorite_cover_prob": game["favorite_cover_prob"],
+ "underdog_cover_prob": game["underdog_cover_prob"],
+ "spread_edge": game["spread_edge"],
+ }
+ elif market_type == MarketType.TOTAL:
+ return {
+ "home_team": game["home_team"],
+ "away_team": game["away_team"],
+ "total_points": game["total_points"],
+ "over_prob": game["over_prob"],
+ "under_prob": game["under_prob"],
+ "total_edge": game["total_edge"],
+ }
+
+ return None
+
+ def calculate_edge(
+ self,
+ model_prob: float,
+ american_odds: int,
+ ) -> float:
+ """Calculate edge (model prob - implied prob).
+
+ Args:
+ model_prob: Model probability (0-1)
+ american_odds: American odds (e.g., -110, +150)
+
+ Returns:
+ Edge as decimal (e.g., 0.05 = 5% edge)
+ """
+ # Convert American odds to implied probability
+ if american_odds < 0:
+ implied_prob = abs(american_odds) / (abs(american_odds) + 100)
+ else:
+ implied_prob = 100 / (american_odds + 100)
+
+ # Edge = model probability - implied probability
+ return model_prob - implied_prob
+
+ def format_line_change(self, change: dict[str, Any]) -> str:
+ """Format line change for display.
+
+ Args:
+ change: Line change data
+
+ Returns:
+ Formatted string
+ """
+ team = change.get("team", f"Game#{change['game_num']}")
+ market = change["market_type"].value
+ line = change.get("line_points")
+ line_display = f"{line:+.1f}" if line else "ML"
+ steam_flag = " [STEAM]" if change.get("is_steam") else ""
+
+ return f"{team} {market} {line_display}{steam_flag}"
+
+ async def monitor(self, duration_seconds: int | None = None) -> None:
+ """Start monitoring live lines.
+
+ Args:
+ duration_seconds: How long to monitor (None = indefinite)
+ """
+ logger.info("=" * 80)
+ logger.info("LIVE LINE MONITORING STARTED")
+ logger.info("=" * 80)
+ logger.info(f"Duration: {duration_seconds}s" if duration_seconds else "Indefinite")
+ logger.info(f"Value alert threshold: {self.min_edge:.1%}")
+ logger.info("=" * 80 + "\n")
+
+ try:
+ async with OvertimeSignalRClient() as client:
+ logger.info("[OK] Connected to Overtime.ag SignalR stream\n")
+
+ async for change in client.stream_line_changes(duration_seconds):
+ # Filter out team totals and derivatives
+ if change.line_points:
+ # Skip spreads > 50 (team totals misclassified as spreads)
+ if change.market_type == MarketType.SPREAD and change.line_points > 50:
+ continue
+
+ # Skip totals outside normal game total range (110-210)
+ # This filters: team totals (~70-90), 1H totals (~60-80), quarters, etc.
+ if change.market_type == MarketType.TOTAL and (
+ change.line_points < 110 or change.line_points > 210
+ ):
+ continue
+
+ self.total_changes += 1
+ game_num = change.game_num
+
+ # Track steam moves
+ if change.is_steam:
+ self.steam_count[game_num] += 1
+
+ # Store in history
+ self.line_history[game_num].append(
+ {
+ "timestamp": change.timestamp,
+ "market_type": change.market_type,
+ "line_points": change.line_points,
+ "side_role": change.side_role,
+ "money1": change.money1,
+ "money2": change.money2,
+ "is_steam": change.is_steam,
+ }
+ )
+
+ # Display line change
+ line_str = self.format_line_change(change.model_dump())
+ logger.info(f"[LINE] {line_str}")
+
+ # Check for predictions and calculate edge
+ if change.team:
+ prediction = self.get_prediction_for_team(change.team, change.market_type)
+
+ if prediction:
+ # Calculate edge based on market type
+ if change.market_type == MarketType.SPREAD:
+ if change.team == prediction["favorite_team"]:
+ model_prob = prediction["favorite_cover_prob"]
+ odds = change.money1 # Favorite odds
+ else:
+ model_prob = prediction["underdog_cover_prob"]
+ odds = change.money2 # Underdog odds
+
+ edge = self.calculate_edge(model_prob, odds)
+
+ logger.info(
+ f" [MODEL] Fav cover prob: "
+ f"{prediction['favorite_cover_prob']:.1%} | "
+ f"Edge: {edge:+.1%}"
+ )
+
+ # Value alert
+ if abs(edge) >= self.min_edge:
+ self.value_alerts += 1
+ side = "FAVORITE" if edge > 0 else "UNDERDOG"
+ logger.info(
+ f" [VALUE] ** {side} EDGE DETECTED: {edge:+.1%} **"
+ )
+
+ elif change.market_type == MarketType.TOTAL:
+ # Determine if this is over or under side
+ # For totals, money1 is typically over, money2 is under
+ over_prob = prediction["over_prob"]
+ under_prob = prediction["under_prob"]
+
+ over_edge = self.calculate_edge(over_prob, change.money1)
+ under_edge = self.calculate_edge(under_prob, change.money2)
+
+ logger.info(
+ f" [MODEL] Over: {over_prob:.1%} "
+ f"(edge: {over_edge:+.1%}) | "
+ f"Under: {under_prob:.1%} (edge: {under_edge:+.1%})"
+ )
+
+ # Value alert for best edge
+ max_edge = max(abs(over_edge), abs(under_edge))
+ if max_edge >= self.min_edge:
+ self.value_alerts += 1
+ side = "OVER" if abs(over_edge) > abs(under_edge) else "UNDER"
+ edge_val = over_edge if side == "OVER" else under_edge
+ logger.info(
+ f" [VALUE] ** {side} EDGE DETECTED: {edge_val:+.1%} **"
+ )
+
+ logger.info("") # Blank line for readability
+
+ except ConfigurationError as e:
+ logger.error(f"Configuration error: {e}")
+ logger.error(
+ "Make sure Chrome is running with remote debugging and overtime.ag tab is open"
+ )
+ raise
+ except KeyboardInterrupt:
+ logger.info("\n\nMonitoring stopped by user")
+ except Exception as e:
+ logger.exception(f"Monitoring error: {e}")
+ raise
+
+ # Display summary
+ logger.info("\n" + "=" * 80)
+ logger.info("MONITORING SUMMARY")
+ logger.info("=" * 80)
+ logger.info(f"Total line changes: {self.total_changes}")
+ logger.info(f"Games tracked: {len(self.line_history)}")
+ logger.info(f"Steam moves: {sum(self.steam_count.values())}")
+ logger.info(f"Value alerts: {self.value_alerts}")
+ logger.info("=" * 80)
+
+
+async def main_async(args: argparse.Namespace) -> None:
+ """Async main function."""
+ # Determine predictions path
+ target_date = date.fromisoformat(args.date) if args.date else date.today()
+ predictions_dir = Path(args.predictions_dir)
+
+ if args.predictions:
+ predictions_path = Path(args.predictions)
+ else:
+ predictions_path = predictions_dir / f"{target_date.isoformat()}.csv"
+
+ # Create monitor
+ monitor = LiveLineMonitor(
+ predictions_path=predictions_path,
+ min_edge=args.min_edge,
+ )
+
+ # Start monitoring
+ await monitor.monitor(duration_seconds=args.duration)
+
+
+def main() -> None:
+ """CLI entry point."""
+ parser = argparse.ArgumentParser(
+ description="Monitor live Overtime.ag lines with ML predictions"
+ )
+ parser.add_argument(
+ "--date",
+ type=str,
+ default=None,
+ help="Target date for predictions (YYYY-MM-DD, default: today)",
+ )
+ parser.add_argument(
+ "--predictions",
+ type=str,
+ default=None,
+ help="Path to predictions CSV (default: data/outputs/predictions/YYYY-MM-DD.csv)",
+ )
+ parser.add_argument(
+ "--predictions-dir",
+ type=Path,
+ default=Path("data/outputs/predictions"),
+ help="Directory containing predictions (default: data/outputs/predictions/)",
+ )
+ parser.add_argument(
+ "--duration",
+ "-d",
+ type=int,
+ default=None,
+ help="Monitoring duration in seconds (default: indefinite)",
+ )
+ parser.add_argument(
+ "--min-edge",
+ type=float,
+ default=0.05,
+ help="Minimum edge for value alerts (default: 0.05 = 5%%)",
+ )
+
+ args = parser.parse_args()
+
+ try:
+ asyncio.run(main_async(args))
+ except ConfigurationError:
+ logger.error("Configuration error - check Chrome and overtime.ag setup")
+ sys.exit(1)
+ except FileNotFoundError as e:
+ logger.error(str(e))
+ sys.exit(1)
+ except Exception as e:
+ logger.exception(f"Monitor failed: {e}")
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/monitor_live_rich.py b/scripts/archive/2026-02-deprecated/monitor_live_rich.py
new file mode 100644
index 000000000..33e3db470
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/monitor_live_rich.py
@@ -0,0 +1,568 @@
+"""Monitor live Overtime.ag line movements with Rich formatting.
+
+Beautiful terminal display with colors, tables, and live updates for monitoring
+sports betting lines alongside ML predictions.
+
+Prerequisites:
+ 1. Chrome with remote debugging (port 9222)
+ 2. overtime.ag logged in, navigated to College Basketball
+ 3. Today's predictions generated
+
+Usage:
+ uv run python scripts/monitor_live_rich.py
+ uv run python scripts/monitor_live_rich.py --min-edge 0.10
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import sys
+from datetime import date, datetime
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+from rich.console import Console
+from rich.layout import Layout
+from rich.live import Live
+from rich.panel import Panel
+from rich.table import Table
+from rich.text import Text
+
+# Add project root to path
+PROJECT_ROOT = Path(__file__).parent.parent
+sys.path.insert(0, str(PROJECT_ROOT / "src"))
+
+from sports_betting_edge.adapters.overtime_ag import ( # noqa: E402
+ OvertimeSignalRClient,
+)
+from sports_betting_edge.core.exceptions import ConfigurationError # noqa: E402
+from sports_betting_edge.core.types import MarketType # noqa: E402
+
+console = Console()
+
+
+class RichLineMonitor:
+ """Monitors live lines with Rich terminal display."""
+
+ def __init__(
+ self,
+ predictions_path: Path,
+ min_edge: float = 0.05,
+ ):
+ """Initialize the monitor.
+
+ Args:
+ predictions_path: Path to predictions CSV
+ min_edge: Minimum edge for value alerts
+ """
+ self.predictions_path = predictions_path
+ self.min_edge = min_edge
+ self.predictions = self._load_predictions()
+
+ # Tracking
+ self.recent_lines: list[dict[str, Any]] = []
+ self.max_recent = 15
+ self.total_changes = 0
+ self.steam_count = 0
+ self.value_alerts = 0
+ self.value_opportunities: list[dict[str, Any]] = []
+
+ # Track opening and current lines by game
+ # Key: (game_num, market_type)
+ self.opening_lines: dict[tuple[int, MarketType], dict[str, Any]] = {}
+ self.current_lines: dict[tuple[int, MarketType], dict[str, Any]] = {}
+
+ # Game metadata: game_num -> {team, game_time, etc}
+ self.games: dict[int, dict[str, Any]] = {}
+
+ def _load_predictions(self) -> pd.DataFrame:
+ """Load predictions."""
+ if not self.predictions_path.exists():
+ raise FileNotFoundError(
+ f"Predictions not found: {self.predictions_path}\n"
+ f"Run: uv run python scripts/predict_today.py"
+ )
+ return pd.read_csv(self.predictions_path)
+
+ def get_prediction_for_team(
+ self, team_name: str, market_type: MarketType
+ ) -> dict[str, Any] | None:
+ """Get prediction for team/game."""
+ games_df = self.predictions[
+ (self.predictions["home_team"] == team_name)
+ | (self.predictions["away_team"] == team_name)
+ | (self.predictions["favorite_team"] == team_name)
+ | (self.predictions["underdog_team"] == team_name)
+ ]
+
+ if len(games_df) == 0:
+ return None
+
+ game = games_df.iloc[0]
+
+ if market_type == MarketType.SPREAD:
+ return {
+ "favorite_team": game["favorite_team"],
+ "spread_magnitude": game["spread_magnitude"],
+ "favorite_cover_prob": game["favorite_cover_prob"],
+ "spread_edge": game["spread_edge"],
+ }
+ elif market_type == MarketType.TOTAL:
+ return {
+ "total_points": game["total_points"],
+ "over_prob": game["over_prob"],
+ "total_edge": game["total_edge"],
+ }
+ return None
+
+ def calculate_edge(self, model_prob: float, american_odds: int) -> float:
+ """Calculate edge."""
+ if american_odds < 0:
+ implied_prob = abs(american_odds) / (abs(american_odds) + 100)
+ else:
+ implied_prob = 100 / (american_odds + 100)
+ return model_prob - implied_prob
+
+ def _get_market_label(self, change: Any) -> str:
+ """Determine proper market label based on market type, period, and line value.
+
+ Args:
+ change: Line change object
+
+ Returns:
+ Market label string (e.g., "SPREAD", "1H TOTAL", "TEAM TOTAL")
+ """
+ market_type = change.market_type
+ period = getattr(change, "period_number", 0)
+ line_points = change.line_points
+
+ # Period-based classification
+ if period == 1:
+ return "1H SPREAD" if market_type == MarketType.SPREAD else "1H TOTAL"
+ elif period == 2:
+ return "2H SPREAD" if market_type == MarketType.SPREAD else "2H TOTAL"
+
+ # Full game - check for team totals based on line value
+ if market_type == MarketType.SPREAD:
+ if line_points and line_points > 50:
+ return "TEAM TOTAL"
+ return "SPREAD"
+ elif market_type == MarketType.TOTAL:
+ if line_points and (line_points < 110 or line_points > 210):
+ return "TEAM TOTAL"
+ return "TOTAL"
+ elif market_type == MarketType.MONEYLINE:
+ return "ML"
+
+ return market_type.value
+
+ def create_header(self) -> Panel:
+ """Create header panel."""
+ header_text = Text()
+ header_text.append("LIVE LINE MONITOR", style="bold cyan")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Games: {len(self.predictions)}", style="yellow")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Changes: {self.total_changes}", style="green")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Steam: {self.steam_count}", style="red")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Value Alerts: {self.value_alerts}", style="magenta")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Edge: {self.min_edge:.1%}+", style="blue")
+
+ return Panel(
+ header_text,
+ border_style="cyan",
+ )
+
+ def create_recent_lines_table(self) -> Table:
+ """Create table of recent line changes."""
+ table = Table(title="Recent Line Movements", show_header=True, header_style="bold")
+ table.add_column("Time", style="cyan", width=8)
+ table.add_column("Team/Game", style="white", width=22)
+ table.add_column("Market", style="yellow", width=8)
+ table.add_column("Line", style="green", width=12)
+ table.add_column("Odds", style="magenta", width=15)
+ table.add_column("Steam", style="red", width=15)
+
+ for line in reversed(self.recent_lines[-self.max_recent :]):
+ time_str = line["time"]
+ team = line["team"][:22] if len(line["team"]) > 22 else line["team"]
+ market = line["market"]
+ line_str = line["line"]
+
+ # Format odds (both sides)
+ odds_str = ""
+ if "money1" in line and "money2" in line:
+ m1 = line["money1"]
+ m2 = line["money2"]
+ if m1 and m2:
+ odds_str = f"{m1:+d}/{m2:+d}"
+
+ steam_info = ""
+ if line["is_steam"]:
+ changed_by = line.get("changed_by", "AutoMover")
+ steam_info = f"[S] {changed_by[:10]}"
+
+ table.add_row(time_str, team, market, line_str, odds_str, steam_info)
+
+ return table
+
+ def create_value_table(self) -> Table:
+ """Create table of value opportunities."""
+ table = Table(title="Value Opportunities", show_header=True, header_style="bold green")
+ table.add_column("Time", style="cyan", width=8)
+ table.add_column("Game", style="white", width=30)
+ table.add_column("Market", style="yellow", width=10)
+ table.add_column("Side", style="green", width=12)
+ table.add_column("Edge", style="magenta", width=8)
+
+ for opp in reversed(self.value_opportunities[-10:]):
+ table.add_row(
+ opp["time"],
+ opp["game"][:30],
+ opp["market"],
+ opp["side"],
+ f"{opp['edge']:+.1%}",
+ )
+
+ if len(self.value_opportunities) == 0:
+ table.add_row("-", "No value opportunities detected yet", "-", "-", "-")
+
+ return table
+
+ def create_predictions_table(self) -> Table:
+ """Create table of top predictions."""
+ table = Table(
+ title="Today's Top Predictions",
+ show_header=True,
+ header_style="bold blue",
+ caption="Spread = P(Favorite Covers) | Total = P(Over)",
+ caption_style="dim italic",
+ )
+ table.add_column("Game", style="white", width=35)
+ table.add_column("Spread", style="yellow", width=15)
+ table.add_column("Total", style="green", width=15)
+
+ # Show games with strongest predictions
+ for _, game in self.predictions.head(8).iterrows():
+ game_str = f"{game['away_team'][:15]} @ {game['home_team'][:15]}"
+ spread_str = f"{game['favorite_cover_prob']:.0%}"
+ total_str = f"O {game['over_prob']:.0%}"
+
+ table.add_row(game_str, spread_str, total_str)
+
+ return table
+
+ def create_all_games_table(self) -> Table:
+ """Create table showing all tracked games with current and opening odds."""
+ table = Table(
+ title="Live Games - Current vs Opening (Movement)",
+ show_header=True,
+ header_style="bold cyan",
+ )
+ table.add_column("Rot#", style="cyan", width=7)
+ table.add_column("Team", style="white", width=16)
+ table.add_column("Time", style="magenta", width=10)
+ table.add_column("Spread", style="yellow", width=26)
+ table.add_column("Total", style="green", width=26)
+
+ if not self.games:
+ table.add_row("-", "No games tracked yet", "-", "-", "-")
+ return table
+
+ # Show all tracked games sorted by game_num
+ for game_num in sorted(self.games.keys()):
+ game_info = self.games[game_num]
+ team = game_info.get("team", f"Game {game_num}")
+
+ # Get rotation numbers
+ rot1 = game_info.get("team1_rot_num")
+ rot2 = game_info.get("team2_rot_num")
+ rot_str = f"{rot1}/{rot2}" if rot1 and rot2 else "-"
+
+ # Get game time
+ game_time = game_info.get("game_time")
+ time_str = game_time.strftime("%I:%M %p") if game_time else "-"
+
+ # Get current and opening for each market
+ spread_key = (game_num, MarketType.SPREAD)
+ total_key = (game_num, MarketType.TOTAL)
+
+ # Format spread with both sides and movement
+ spread_str = "-"
+ if spread_key in self.current_lines:
+ curr = self.current_lines[spread_key]
+ open_line = self.opening_lines.get(spread_key)
+ curr_pts = curr.get("line_points", 0) or 0
+ curr_m1 = curr.get("money1", 0) or 0
+ curr_m2 = curr.get("money2", 0) or 0
+
+ # Current line
+ spread_str = f"{curr_pts:+.1f} ({curr_m1:+d}/{curr_m2:+d})"
+
+ # Show movement if we have opening line
+ if open_line:
+ open_pts = open_line.get("line_points", 0) or 0
+ pts_diff = curr_pts - open_pts
+
+ if pts_diff != 0:
+ arrow = "↑" if pts_diff > 0 else "↓"
+ spread_str += f" {arrow}{abs(pts_diff):.1f}"
+ else:
+ spread_str += f" (open: {open_pts:+.1f})"
+
+ # Format total with both sides and movement
+ total_str = "-"
+ if total_key in self.current_lines:
+ curr = self.current_lines[total_key]
+ open_line = self.opening_lines.get(total_key)
+ curr_pts = curr.get("line_points", 0) or 0
+ curr_m1 = curr.get("money1", 0) or 0
+ curr_m2 = curr.get("money2", 0) or 0
+
+ # Current line
+ total_str = f"{curr_pts:.1f} (O:{curr_m1:+d}/U:{curr_m2:+d})"
+
+ # Show movement if we have opening line
+ if open_line:
+ open_pts = open_line.get("line_points", 0) or 0
+ pts_diff = curr_pts - open_pts
+
+ if pts_diff != 0:
+ arrow = "↑" if pts_diff > 0 else "↓"
+ total_str += f" {arrow}{abs(pts_diff):.1f}"
+ else:
+ total_str += f" (open: {open_pts:.1f})"
+
+ table.add_row(
+ rot_str,
+ team[:16],
+ time_str,
+ spread_str,
+ total_str,
+ )
+
+ return table
+
+ def generate_layout(self) -> Layout:
+ """Generate rich layout."""
+ layout = Layout()
+
+ layout.split_column(
+ Layout(name="header", size=3),
+ Layout(name="main"),
+ )
+
+ layout["header"].update(self.create_header())
+
+ layout["main"].split_column(
+ Layout(name="top", ratio=2),
+ Layout(name="bottom", ratio=3),
+ )
+
+ # Top: All games table
+ layout["top"].update(Panel(self.create_all_games_table()))
+
+ # Bottom: Recent changes and value opportunities
+ layout["bottom"].split_row(
+ Layout(name="left", ratio=2),
+ Layout(name="right", ratio=1),
+ )
+
+ layout["bottom"]["left"].split_column(
+ Layout(name="recent", ratio=2),
+ Layout(name="value", ratio=1),
+ )
+
+ layout["bottom"]["left"]["recent"].update(Panel(self.create_recent_lines_table()))
+ layout["bottom"]["left"]["value"].update(Panel(self.create_value_table()))
+ layout["bottom"]["right"].update(Panel(self.create_predictions_table()))
+
+ return layout
+
+ async def monitor(self, duration_seconds: int | None = None) -> None:
+ """Start monitoring with Rich live display."""
+ console.clear()
+ console.print("[bold cyan]Starting Live Monitor...[/bold cyan]")
+ console.print(f"Loaded predictions for {len(self.predictions)} games")
+ console.print(f"Value threshold: {self.min_edge:.1%}\n")
+
+ with Live(self.generate_layout(), refresh_per_second=2, console=console) as live:
+ try:
+ async with OvertimeSignalRClient() as client:
+ async for change in client.stream_line_changes(duration_seconds):
+ # Filter out team totals and derivatives
+ if change.line_points:
+ # Skip spreads > 50 (team totals misclassified as spreads)
+ if change.market_type == MarketType.SPREAD and change.line_points > 50:
+ continue
+
+ # Skip totals outside normal game total range (120-200)
+ # This filters: team totals (~70-90), 1H totals (~60-80), quarters, etc.
+ if change.market_type == MarketType.TOTAL and (
+ change.line_points < 110 or change.line_points > 210
+ ):
+ continue
+
+ self.total_changes += 1
+
+ if change.is_steam:
+ self.steam_count += 1
+
+ # Track game metadata
+ if change.game_num not in self.games:
+ self.games[change.game_num] = {
+ "team": change.team or f"Game {change.game_num}",
+ "game_time": change.game_time,
+ "team1_rot_num": change.team1_rot_num,
+ "team2_rot_num": change.team2_rot_num,
+ }
+
+ # Track opening and current lines
+ line_key = (change.game_num, change.market_type)
+ line_data = {
+ "line_points": change.line_points,
+ "money1": change.money1,
+ "money2": change.money2,
+ "timestamp": change.timestamp,
+ }
+
+ # Set opening line if first time seeing this game/market
+ if line_key not in self.opening_lines:
+ self.opening_lines[line_key] = line_data
+
+ # Always update current line
+ self.current_lines[line_key] = line_data
+
+ # Determine market display label
+ market_label = self._get_market_label(change)
+
+ # Add to recent lines
+ team_display = change.team or f"Game#{change.game_num}"
+ line_display = f"{change.line_points:+.1f}" if change.line_points else "ML"
+
+ self.recent_lines.append(
+ {
+ "time": datetime.now().strftime("%H:%M:%S"),
+ "team": team_display,
+ "market": market_label,
+ "line": line_display,
+ "money1": change.money1,
+ "money2": change.money2,
+ "is_steam": change.is_steam,
+ "changed_by": change.changed_by,
+ }
+ )
+
+ # Check for value
+ if change.team:
+ prediction = self.get_prediction_for_team(
+ change.team, change.market_type
+ )
+
+ if prediction:
+ edge = 0.0
+ side = ""
+ game_str = ""
+
+ if change.market_type == MarketType.SPREAD:
+ if change.team == prediction["favorite_team"] and change.money1:
+ edge = self.calculate_edge(
+ prediction["favorite_cover_prob"],
+ change.money1,
+ )
+ side = "Favorite"
+ elif change.money2:
+ edge = self.calculate_edge(
+ 1 - prediction["favorite_cover_prob"],
+ change.money2,
+ )
+ side = "Underdog"
+ game_str = f"{change.team} {change.line_points:+.1f}"
+
+ elif change.market_type == MarketType.TOTAL:
+ if change.money1 and change.money2:
+ over_edge = self.calculate_edge(
+ prediction["over_prob"], change.money1
+ )
+ under_edge = self.calculate_edge(
+ 1 - prediction["over_prob"], change.money2
+ )
+
+ if abs(over_edge) > abs(under_edge):
+ edge = over_edge
+ side = "Over"
+ else:
+ edge = under_edge
+ side = "Under"
+ game_str = f"{change.team} {prediction['total_points']}"
+
+ # Value alert
+ if abs(edge) >= self.min_edge:
+ self.value_alerts += 1
+ self.value_opportunities.append(
+ {
+ "time": datetime.now().strftime("%H:%M:%S"),
+ "game": game_str,
+ "market": change.market_type.value,
+ "side": side,
+ "edge": edge,
+ }
+ )
+
+ # Update display
+ live.update(self.generate_layout())
+
+ except ConfigurationError as e:
+ console.print(f"[bold red]Error:[/bold red] {e}")
+ console.print("Make sure Chrome and overtime.ag are running")
+ except KeyboardInterrupt:
+ console.print("\n[yellow]Monitoring stopped[/yellow]")
+
+ # Final summary
+ console.print("\n[bold cyan]Monitoring Summary[/bold cyan]")
+ console.print(f"Total changes: {self.total_changes}")
+ console.print(f"Steam moves: {self.steam_count}")
+ console.print(f"Value alerts: {self.value_alerts}")
+
+
+async def main_async(args: argparse.Namespace) -> None:
+ """Async main."""
+ target_date = date.fromisoformat(args.date) if args.date else date.today()
+ predictions_path = (
+ Path(args.predictions)
+ if args.predictions
+ else Path("predictions") / f"{target_date.isoformat()}.csv"
+ )
+
+ monitor = RichLineMonitor(
+ predictions_path=predictions_path,
+ min_edge=args.min_edge,
+ )
+
+ await monitor.monitor(duration_seconds=args.duration)
+
+
+def main() -> None:
+ """CLI entry point."""
+ parser = argparse.ArgumentParser(description="Live monitor with Rich formatting")
+ parser.add_argument("--date", type=str, default=None)
+ parser.add_argument("--predictions", type=str, default=None)
+ parser.add_argument("--duration", "-d", type=int, default=None)
+ parser.add_argument("--min-edge", type=float, default=0.05)
+
+ args = parser.parse_args()
+
+ try:
+ asyncio.run(main_async(args))
+ except Exception as e:
+ console.print(f"[bold red]Error:[/bold red] {e}", style="red")
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/monitor_live_with_logging.py b/scripts/archive/2026-02-deprecated/monitor_live_with_logging.py
new file mode 100644
index 000000000..575a335a1
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/monitor_live_with_logging.py
@@ -0,0 +1,603 @@
+"""Monitor live Overtime.ag line movements with Rich display AND file logging.
+
+This version writes all line movements to a log file while also displaying
+the Rich terminal interface. Perfect for running in background or reviewing later.
+
+Prerequisites:
+ 1. Chrome with remote debugging (port 9222)
+ 2. overtime.ag logged in, navigated to College Basketball
+ 3. Today's predictions generated
+
+Usage:
+ uv run python scripts/monitor_live_with_logging.py
+ uv run python scripts/monitor_live_with_logging.py --min-edge 0.10 --log-file my_lines.log
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import sys
+from datetime import date, datetime
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+from rich.console import Console
+from rich.layout import Layout
+from rich.live import Live
+from rich.panel import Panel
+from rich.table import Table
+from rich.text import Text
+
+# Add project root to path
+PROJECT_ROOT = Path(__file__).parent.parent
+sys.path.insert(0, str(PROJECT_ROOT / "src"))
+
+from sports_betting_edge.adapters.overtime_ag import ( # noqa: E402
+ OvertimeSignalRClient,
+)
+from sports_betting_edge.core.exceptions import ConfigurationError # noqa: E402
+from sports_betting_edge.core.types import MarketType # noqa: E402
+
+console = Console()
+
+
+class LoggingLineMonitor:
+ """Monitors live lines with Rich terminal display AND file logging."""
+
+ def __init__(
+ self,
+ predictions_path: Path,
+ min_edge: float = 0.05,
+ log_file: Path | None = None,
+ ):
+ """Initialize the monitor.
+
+ Args:
+ predictions_path: Path to predictions CSV
+ min_edge: Minimum edge for value alerts
+ log_file: Path to log file (default: logs/line_movements_{date}.log)
+ """
+ self.predictions_path = predictions_path
+ self.min_edge = min_edge
+ self.predictions = self._load_predictions()
+
+ # Set up log file
+ if log_file is None:
+ log_dir = PROJECT_ROOT / "data" / "logs"
+ log_dir.mkdir(parents=True, exist_ok=True)
+ log_file = log_dir / f"line_movements_{date.today().isoformat()}.log"
+ self.log_file = Path(log_file)
+ self.log_file.parent.mkdir(parents=True, exist_ok=True)
+
+ # Initialize log file with header
+ with open(self.log_file, "w") as f:
+ f.write(f"LINE MOVEMENT LOG - {datetime.now().isoformat()}\n")
+ f.write(f"Predictions: {predictions_path}\n")
+ f.write(f"Min Edge: {min_edge:.1%}\n")
+ f.write("=" * 100 + "\n\n")
+
+ # Tracking
+ self.recent_lines: list[dict[str, Any]] = []
+ self.max_recent = 15
+ self.total_changes = 0
+ self.steam_count = 0
+ self.value_alerts = 0
+ self.value_opportunities: list[dict[str, Any]] = []
+
+ # Track opening and current lines by game
+ self.opening_lines: dict[tuple[int, MarketType], dict[str, Any]] = {}
+ self.current_lines: dict[tuple[int, MarketType], dict[str, Any]] = {}
+ self.games: dict[int, dict[str, Any]] = {}
+
+ def _load_predictions(self) -> pd.DataFrame:
+ """Load predictions."""
+ if not self.predictions_path.exists():
+ raise FileNotFoundError(
+ f"Predictions not found: {self.predictions_path}\n"
+ f"Run: uv run python scripts/predict_today.py"
+ )
+ return pd.read_csv(self.predictions_path)
+
+ def _log_line_change(self, change: Any, edge: float = 0.0, side: str = "") -> None:
+ """Log line change to file."""
+ timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+ game_id = change.game_num
+ team = change.team or f"Game#{game_id}"
+ market = change.market_type.value if change.market_type else "unknown"
+ line = f"{change.line_points:+.1f}" if change.line_points else "ML"
+ odds = f"{change.money1}/{change.money2}" if change.money1 and change.money2 else "-"
+ steam = "[STEAM]" if change.is_steam else ""
+
+ log_line = f"{timestamp} | {team:30} | {market:8} | {line:7} | {odds:12} | {steam:8}"
+
+ if abs(edge) >= self.min_edge:
+ log_line += f" | VALUE: {side} {edge:+.1%}"
+
+ with open(self.log_file, "a") as f:
+ f.write(log_line + "\n")
+
+ def _log_summary(self, message: str) -> None:
+ """Log summary message to file."""
+ timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+ with open(self.log_file, "a") as f:
+ f.write(f"\n{timestamp} | {message}\n")
+
+ def get_prediction_for_team(
+ self, team_name: str, market_type: MarketType
+ ) -> dict[str, Any] | None:
+ """Get prediction for team/game."""
+ games_df = self.predictions[
+ (self.predictions["home_team"] == team_name)
+ | (self.predictions["away_team"] == team_name)
+ | (self.predictions["favorite_team"] == team_name)
+ | (self.predictions["underdog_team"] == team_name)
+ ]
+
+ if len(games_df) == 0:
+ return None
+
+ game = games_df.iloc[0]
+
+ if market_type == MarketType.SPREAD:
+ return {
+ "favorite_team": game["favorite_team"],
+ "spread_magnitude": game["spread_magnitude"],
+ "favorite_cover_prob": game["favorite_cover_prob"],
+ "spread_edge": game["spread_edge"],
+ }
+ elif market_type == MarketType.TOTAL:
+ return {
+ "total_points": game["total_points"],
+ "over_prob": game["over_prob"],
+ "total_edge": game["total_edge"],
+ }
+ return None
+
+ def calculate_edge(self, model_prob: float, american_odds: int) -> float:
+ """Calculate edge."""
+ if american_odds < 0:
+ implied_prob = abs(american_odds) / (abs(american_odds) + 100)
+ else:
+ implied_prob = 100 / (american_odds + 100)
+ return model_prob - implied_prob
+
+ def _get_market_label(self, change: Any) -> str:
+ """Determine proper market label based on market type, period, and line value."""
+ market_type = change.market_type
+ period = getattr(change, "period_number", 0)
+ line_points = change.line_points
+
+ if period == 1:
+ return "1H SPREAD" if market_type == MarketType.SPREAD else "1H TOTAL"
+ elif period == 2:
+ return "2H SPREAD" if market_type == MarketType.SPREAD else "2H TOTAL"
+
+ if market_type == MarketType.SPREAD:
+ if line_points and line_points > 50:
+ return "TEAM TOTAL"
+ return "SPREAD"
+ elif market_type == MarketType.TOTAL:
+ if line_points and (line_points < 110 or line_points > 210):
+ return "TEAM TOTAL"
+ return "TOTAL"
+ elif market_type == MarketType.MONEYLINE:
+ return "ML"
+
+ return market_type.value
+
+ def create_header(self) -> Panel:
+ """Create header panel."""
+ header_text = Text()
+ header_text.append("LIVE LINE MONITOR", style="bold cyan")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Games: {len(self.predictions)}", style="yellow")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Changes: {self.total_changes}", style="green")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Steam: {self.steam_count}", style="red")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Value Alerts: {self.value_alerts}", style="magenta")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Edge: {self.min_edge:.1%}+", style="blue")
+ header_text.append(" | ", style="white")
+ header_text.append(f"Log: {self.log_file.name}", style="dim")
+
+ return Panel(header_text, border_style="cyan")
+
+ def create_recent_lines_table(self) -> Table:
+ """Create table of recent line changes."""
+ table = Table(title="Recent Line Movements", show_header=True, header_style="bold")
+ table.add_column("Time", style="cyan", width=8)
+ table.add_column("Team/Game", style="white", width=22)
+ table.add_column("Market", style="yellow", width=8)
+ table.add_column("Line", style="green", width=12)
+ table.add_column("Odds", style="magenta", width=15)
+ table.add_column("Steam", style="red", width=15)
+
+ for line in reversed(self.recent_lines[-self.max_recent :]):
+ time_str = line["time"]
+ team = line["team"][:22] if len(line["team"]) > 22 else line["team"]
+ market = line["market"]
+ line_str = line["line"]
+
+ odds_str = ""
+ if "money1" in line and "money2" in line:
+ m1 = line["money1"]
+ m2 = line["money2"]
+ if m1 and m2:
+ odds_str = f"{m1:+d}/{m2:+d}"
+
+ steam_info = ""
+ if line["is_steam"]:
+ changed_by = line.get("changed_by", "AutoMover")
+ steam_info = f"[S] {changed_by[:10]}"
+
+ table.add_row(time_str, team, market, line_str, odds_str, steam_info)
+
+ return table
+
+ def create_value_table(self) -> Table:
+ """Create table of value opportunities."""
+ table = Table(title="Value Opportunities", show_header=True, header_style="bold green")
+ table.add_column("Time", style="cyan", width=8)
+ table.add_column("Game", style="white", width=30)
+ table.add_column("Market", style="yellow", width=10)
+ table.add_column("Side", style="green", width=12)
+ table.add_column("Edge", style="magenta", width=8)
+
+ for opp in reversed(self.value_opportunities[-10:]):
+ table.add_row(
+ opp["time"],
+ opp["game"][:30],
+ opp["market"],
+ opp["side"],
+ f"{opp['edge']:+.1%}",
+ )
+
+ if len(self.value_opportunities) == 0:
+ table.add_row("-", "No value opportunities detected yet", "-", "-", "-")
+
+ return table
+
+ def create_predictions_table(self) -> Table:
+ """Create table of top predictions."""
+ table = Table(
+ title="Today's Top Predictions",
+ show_header=True,
+ header_style="bold blue",
+ caption="Spread = P(Favorite Covers) | Total = P(Over)",
+ caption_style="dim italic",
+ )
+ table.add_column("Game", style="white", width=35)
+ table.add_column("Spread", style="yellow", width=15)
+ table.add_column("Total", style="green", width=15)
+
+ for _, game in self.predictions.head(8).iterrows():
+ game_str = f"{game['away_team'][:15]} @ {game['home_team'][:15]}"
+ spread_str = f"{game['favorite_cover_prob']:.0%}"
+ total_str = f"O {game['over_prob']:.0%}"
+
+ table.add_row(game_str, spread_str, total_str)
+
+ return table
+
+ def create_all_games_table(self) -> Table:
+ """Create table showing all tracked games with current and opening odds."""
+ table = Table(
+ title="Live Games - Current vs Opening (Movement)",
+ show_header=True,
+ header_style="bold cyan",
+ )
+ table.add_column("Rot#", style="cyan", width=7)
+ table.add_column("Team", style="white", width=16)
+ table.add_column("Time", style="magenta", width=10)
+ table.add_column("Spread", style="yellow", width=26)
+ table.add_column("Total", style="green", width=26)
+
+ if not self.games:
+ table.add_row("-", "No games tracked yet", "-", "-", "-")
+ return table
+
+ for game_num in sorted(self.games.keys()):
+ game_info = self.games[game_num]
+ team = game_info.get("team", f"Game {game_num}")
+
+ rot1 = game_info.get("team1_rot_num")
+ rot2 = game_info.get("team2_rot_num")
+ rot_str = f"{rot1}/{rot2}" if rot1 and rot2 else "-"
+
+ game_time = game_info.get("game_time")
+ time_str = game_time.strftime("%I:%M %p") if game_time else "-"
+
+ spread_key = (game_num, MarketType.SPREAD)
+ total_key = (game_num, MarketType.TOTAL)
+
+ spread_str = "-"
+ if spread_key in self.current_lines:
+ curr = self.current_lines[spread_key]
+ open_line = self.opening_lines.get(spread_key)
+ curr_pts = curr.get("line_points", 0) or 0
+ curr_m1 = curr.get("money1", 0) or 0
+ curr_m2 = curr.get("money2", 0) or 0
+
+ spread_str = f"{curr_pts:+.1f} ({curr_m1:+d}/{curr_m2:+d})"
+
+ if open_line:
+ open_pts = open_line.get("line_points", 0) or 0
+ pts_diff = curr_pts - open_pts
+
+ if pts_diff != 0:
+ arrow = "UP" if pts_diff > 0 else "DN"
+ spread_str += f" {arrow}{abs(pts_diff):.1f}"
+ else:
+ spread_str += f" (open: {open_pts:+.1f})"
+
+ total_str = "-"
+ if total_key in self.current_lines:
+ curr = self.current_lines[total_key]
+ open_line = self.opening_lines.get(total_key)
+ curr_pts = curr.get("line_points", 0) or 0
+ curr_m1 = curr.get("money1", 0) or 0
+ curr_m2 = curr.get("money2", 0) or 0
+
+ total_str = f"{curr_pts:.1f} (O:{curr_m1:+d}/U:{curr_m2:+d})"
+
+ if open_line:
+ open_pts = open_line.get("line_points", 0) or 0
+ pts_diff = curr_pts - open_pts
+
+ if pts_diff != 0:
+ arrow = "UP" if pts_diff > 0 else "DN"
+ total_str += f" {arrow}{abs(pts_diff):.1f}"
+ else:
+ total_str += f" (open: {open_pts:.1f})"
+
+ table.add_row(
+ rot_str,
+ team[:16],
+ time_str,
+ spread_str,
+ total_str,
+ )
+
+ return table
+
+ def generate_layout(self) -> Layout:
+ """Generate rich layout."""
+ layout = Layout()
+
+ layout.split_column(
+ Layout(name="header", size=3),
+ Layout(name="main"),
+ )
+
+ layout["header"].update(self.create_header())
+
+ layout["main"].split_column(
+ Layout(name="top", ratio=2),
+ Layout(name="bottom", ratio=3),
+ )
+
+ layout["top"].update(Panel(self.create_all_games_table()))
+
+ layout["bottom"].split_row(
+ Layout(name="left", ratio=2),
+ Layout(name="right", ratio=1),
+ )
+
+ layout["bottom"]["left"].split_column(
+ Layout(name="recent", ratio=2),
+ Layout(name="value", ratio=1),
+ )
+
+ layout["bottom"]["left"]["recent"].update(Panel(self.create_recent_lines_table()))
+ layout["bottom"]["left"]["value"].update(Panel(self.create_value_table()))
+ layout["bottom"]["right"].update(Panel(self.create_predictions_table()))
+
+ return layout
+
+ async def monitor(self, duration_seconds: int | None = None) -> None:
+ """Start monitoring with Rich live display and file logging."""
+ console.clear()
+ console.print("[bold cyan]Starting Live Monitor with Logging...[/bold cyan]")
+ console.print(f"Loaded predictions for {len(self.predictions)} games")
+ console.print(f"Value threshold: {self.min_edge:.1%}")
+ console.print(f"[bold green]Logging to: {self.log_file}[/bold green]\n")
+
+ self._log_summary(
+ f"Monitor started | Games: {len(self.predictions)} | Min Edge: {self.min_edge:.1%}"
+ )
+
+ with Live(self.generate_layout(), refresh_per_second=2, console=console) as live:
+ try:
+ async with OvertimeSignalRClient() as client:
+ async for change in client.stream_line_changes(duration_seconds):
+ # Filter out team totals and derivatives
+ if change.line_points:
+ if change.market_type == MarketType.SPREAD and change.line_points > 50:
+ continue
+ if change.market_type == MarketType.TOTAL and (
+ change.line_points < 110 or change.line_points > 210
+ ):
+ continue
+
+ self.total_changes += 1
+
+ if change.is_steam:
+ self.steam_count += 1
+
+ # Track game metadata
+ if change.game_num not in self.games:
+ self.games[change.game_num] = {
+ "team": change.team or f"Game {change.game_num}",
+ "game_time": change.game_time,
+ "team1_rot_num": change.team1_rot_num,
+ "team2_rot_num": change.team2_rot_num,
+ }
+
+ # Track opening and current lines
+ line_key = (change.game_num, change.market_type)
+ line_data = {
+ "line_points": change.line_points,
+ "money1": change.money1,
+ "money2": change.money2,
+ "timestamp": change.timestamp,
+ }
+
+ if line_key not in self.opening_lines:
+ self.opening_lines[line_key] = line_data
+
+ self.current_lines[line_key] = line_data
+
+ market_label = self._get_market_label(change)
+ team_display = change.team or f"Game#{change.game_num}"
+ line_display = f"{change.line_points:+.1f}" if change.line_points else "ML"
+
+ self.recent_lines.append(
+ {
+ "time": datetime.now().strftime("%H:%M:%S"),
+ "team": team_display,
+ "market": market_label,
+ "line": line_display,
+ "money1": change.money1,
+ "money2": change.money2,
+ "is_steam": change.is_steam,
+ "changed_by": change.changed_by,
+ }
+ )
+
+ # Check for value and log
+ edge = 0.0
+ side = ""
+ if change.team:
+ prediction = self.get_prediction_for_team(
+ change.team, change.market_type
+ )
+
+ if prediction:
+ if change.market_type == MarketType.SPREAD:
+ if change.team == prediction["favorite_team"] and change.money1:
+ edge = self.calculate_edge(
+ prediction["favorite_cover_prob"],
+ change.money1,
+ )
+ side = "Favorite"
+ elif change.money2:
+ edge = self.calculate_edge(
+ 1 - prediction["favorite_cover_prob"],
+ change.money2,
+ )
+ side = "Underdog"
+
+ elif (
+ change.market_type == MarketType.TOTAL
+ and change.money1
+ and change.money2
+ ):
+ over_edge = self.calculate_edge(
+ prediction["over_prob"], change.money1
+ )
+ under_edge = self.calculate_edge(
+ 1 - prediction["over_prob"], change.money2
+ )
+
+ if abs(over_edge) > abs(under_edge):
+ edge = over_edge
+ side = "Over"
+ else:
+ edge = under_edge
+ side = "Under"
+
+ if abs(edge) >= self.min_edge:
+ self.value_alerts += 1
+ game_str = (
+ f"{change.team} {change.line_points:+.1f}"
+ if change.market_type == MarketType.SPREAD
+ else f"{change.team} {prediction.get('total_points', 0)}"
+ )
+ self.value_opportunities.append(
+ {
+ "time": datetime.now().strftime("%H:%M:%S"),
+ "game": game_str,
+ "market": change.market_type.value,
+ "side": side,
+ "edge": edge,
+ }
+ )
+
+ # Log to file
+ self._log_line_change(change, edge, side)
+
+ # Update display
+ live.update(self.generate_layout())
+
+ except ConfigurationError as e:
+ console.print(f"[bold red]Error:[/bold red] {e}")
+ console.print("Make sure Chrome and overtime.ag are running")
+ self._log_summary(f"ERROR: {e}")
+ except KeyboardInterrupt:
+ console.print("\n[yellow]Monitoring stopped[/yellow]")
+ self._log_summary("Monitor stopped by user")
+
+ # Final summary
+ console.print("\n[bold cyan]Monitoring Summary[/bold cyan]")
+ console.print(f"Total changes: {self.total_changes}")
+ console.print(f"Steam moves: {self.steam_count}")
+ console.print(f"Value alerts: {self.value_alerts}")
+ console.print(f"[bold green]Log saved to: {self.log_file}[/bold green]")
+
+ self._log_summary(
+ f"Monitor ended | Changes: {self.total_changes} | "
+ f"Steam: {self.steam_count} | Value: {self.value_alerts}"
+ )
+
+
+async def main_async(args: argparse.Namespace) -> None:
+ """Async main."""
+ target_date = date.fromisoformat(args.date) if args.date else date.today()
+ predictions_path = (
+ Path(args.predictions)
+ if args.predictions
+ else Path("predictions") / f"{target_date.isoformat()}.csv"
+ )
+
+ log_file = Path(args.log_file) if args.log_file else None
+
+ monitor = LoggingLineMonitor(
+ predictions_path=predictions_path,
+ min_edge=args.min_edge,
+ log_file=log_file,
+ )
+
+ await monitor.monitor(duration_seconds=args.duration)
+
+
+def main() -> None:
+ """CLI entry point."""
+ parser = argparse.ArgumentParser(
+ description="Live monitor with Rich formatting and file logging"
+ )
+ parser.add_argument("--date", type=str, default=None, help="Date for predictions")
+ parser.add_argument("--predictions", type=str, default=None, help="Path to predictions CSV")
+ parser.add_argument("--duration", "-d", type=int, default=None, help="Duration in seconds")
+ parser.add_argument("--min-edge", type=float, default=0.05, help="Minimum edge threshold")
+ parser.add_argument(
+ "--log-file",
+ type=str,
+ default=None,
+ help="Path to log file (default: data/logs/line_movements_{date}.log)",
+ )
+
+ args = parser.parse_args()
+
+ try:
+ asyncio.run(main_async(args))
+ except Exception as e:
+ console.print(f"[bold red]Error:[/bold red] {e}", style="red")
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/overtime_ag_scraper.py b/scripts/archive/2026-02-deprecated/overtime_ag_scraper.py
new file mode 100644
index 000000000..4ba48c0a0
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/overtime_ag_scraper.py
@@ -0,0 +1,471 @@
+"""
+Overtime.ag Sports Betting Data Scraper
+
+Demonstrates two approaches for scraping overtime.ag:
+1. REST API - Simple HTTP requests to get current odds
+2. WebSocket - Real-time updates via SignalR
+
+SignalR Hub Details (discovered via chrome-devtools MCP server):
+- Hub name: "gbshub" (lowercase in connection, gbsHub in proxy)
+- Available methods: subscribeSport, subscribeSports, getGame, getGameLines, etc.
+- Subscription format: { SportType, SportSubType, Store, Type }
+- Type: 1 = sport updates, 2 = contest updates
+"""
+
+import asyncio
+import json
+import logging
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+from urllib.parse import quote
+
+import httpx
+import pandas as pd
+import websockets
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+# Data storage paths
+DATA_DIR = Path("data/overtime_ag")
+SNAPSHOTS_DIR = DATA_DIR / "snapshots"
+LIVE_UPDATES_DIR = DATA_DIR / "live_updates"
+
+
+def ensure_data_dirs() -> None:
+ """Create data directories if they don't exist."""
+ SNAPSHOTS_DIR.mkdir(parents=True, exist_ok=True)
+ LIVE_UPDATES_DIR.mkdir(parents=True, exist_ok=True)
+ logger.info(f"Data directories ready: {DATA_DIR}")
+
+
+def save_snapshot(data: dict[str, Any], sport_type: str, sport_subtype: str) -> Path:
+ """Save REST API snapshot to Parquet."""
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+ game_lines = data.get("GameLines", [])
+
+ if not game_lines:
+ logger.warning("No game lines to save")
+ return None
+
+ # Convert to DataFrame
+ df = pd.DataFrame(game_lines)
+
+ # Add metadata columns
+ df["captured_at"] = datetime.now()
+ df["sport_type"] = sport_type
+ df["sport_subtype"] = sport_subtype
+ df["source"] = "overtime_ag_rest"
+
+ # Save to Parquet
+ filename = f"{sport_subtype.replace(' ', '_')}_{timestamp}.parquet"
+ filepath = SNAPSHOTS_DIR / filename
+ df.to_parquet(filepath, index=False)
+
+ logger.info(f"Saved {len(df)} games to {filepath}")
+ return filepath
+
+
+class LiveUpdateWriter:
+ """Handles writing live updates to JSONL file for a session."""
+
+ def __init__(self) -> None:
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+ filename = f"live_updates_{timestamp}.jsonl"
+ self.filepath = LIVE_UPDATES_DIR / filename
+ self.update_count = 0
+
+ def write(self, update: dict[str, Any], sport_type: str, sport_subtype: str) -> None:
+ """Append update to JSONL file."""
+ # Add metadata
+ update["captured_at"] = datetime.now().isoformat()
+ update["sport_type"] = sport_type
+ update["sport_subtype"] = sport_subtype
+ update["source"] = "overtime_ag_websocket"
+
+ # Append to JSONL
+ with open(self.filepath, "a") as f:
+ f.write(json.dumps(update) + "\n")
+
+ self.update_count += 1
+ logger.debug(f"Appended update #{self.update_count} to {self.filepath}")
+
+
+class OvertimeAgRESTScraper:
+ """Scraper using REST API endpoints."""
+
+ BASE_URL = "https://www.overtime.ag"
+ API_BASE = f"{BASE_URL}/sports/Api"
+
+ def __init__(self) -> None:
+ self.client = httpx.AsyncClient(
+ headers={
+ "Accept": "application/json, text/plain, */*",
+ "Content-Type": "application/json",
+ "Origin": self.BASE_URL,
+ "Referer": f"{self.BASE_URL}/sports",
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
+ }
+ )
+
+ async def get_sport_offering(
+ self,
+ sport_type: str,
+ sport_subtype: str,
+ wager_type: str = "Straight Bet",
+ ) -> dict[str, Any]:
+ """
+ Get current odds for a sport.
+
+ Args:
+ sport_type: Sport category (e.g., "Basketball", "Football")
+ sport_subtype: Specific league (e.g., "College Basketball", "NFL")
+ wager_type: Type of wager (default: "Straight Bet")
+
+ Returns:
+ Dictionary containing game lines and odds data
+ """
+ url = f"{self.API_BASE}/Offering.asmx/GetSportOffering"
+
+ payload = {
+ "sportType": sport_type,
+ "sportSubType": sport_subtype,
+ "wagerType": wager_type,
+ "hoursAdjustment": 0,
+ "periodNumber": None,
+ "gameNum": None,
+ "parentGameNum": None,
+ "teaserName": "",
+ "requestMode": None,
+ }
+
+ logger.info(f"Fetching {sport_type} - {sport_subtype} odds...")
+ response = await self.client.post(url, json=payload)
+ response.raise_for_status()
+
+ data = response.json()
+ return data.get("d", {}).get("Data", {})
+
+ async def get_available_sports(self) -> list[dict[str, Any]]:
+ """Get list of available sports."""
+ url = f"{self.API_BASE}/Offering.asmx/GetSports"
+ response = await self.client.post(url, json={})
+ response.raise_for_status()
+ return response.json().get("d", {}).get("Data", [])
+
+ async def close(self) -> None:
+ """Close HTTP client."""
+ await self.client.aclose()
+
+
+class OvertimeAgWebSocketScraper:
+ """Scraper using SignalR WebSocket for real-time updates."""
+
+ SIGNALR_BASE = "https://ws.ticosports.com/signalr"
+ WS_BASE = "wss://ws.ticosports.com/signalr"
+
+ def __init__(self) -> None:
+ self.connection_token: str | None = None
+ self.connection_id: str | None = None
+ self.client = httpx.AsyncClient(
+ headers={
+ "Accept": "text/plain, */*; q=0.01",
+ "Origin": "https://www.overtime.ag",
+ "Referer": "https://www.overtime.ag/",
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
+ }
+ )
+
+ async def negotiate(self) -> dict[str, Any]:
+ """Negotiate SignalR connection and get connection token."""
+ url = f"{self.SIGNALR_BASE}/negotiate"
+ params = {
+ "clientProtocol": "1.5",
+ "connectionData": '[{"name":"gbshub"}]',
+ "_": str(int(datetime.now().timestamp() * 1000)),
+ }
+
+ logger.info("Negotiating SignalR connection...")
+ response = await self.client.get(url, params=params)
+ response.raise_for_status()
+
+ data = response.json()
+ self.connection_token = data["ConnectionToken"]
+ self.connection_id = data["ConnectionId"]
+
+ logger.info(f"Connection ID: {self.connection_id}")
+ return data
+
+ async def start_connection(self) -> dict[str, Any]:
+ """Start SignalR connection."""
+ if not self.connection_token:
+ raise ValueError("Must call negotiate() first")
+
+ url = f"{self.SIGNALR_BASE}/start"
+ params = {
+ "transport": "webSockets",
+ "clientProtocol": "1.5",
+ "connectionToken": self.connection_token,
+ "connectionData": '[{"name":"gbshub"}]',
+ "_": str(int(datetime.now().timestamp() * 1000)),
+ }
+
+ logger.info("Starting SignalR connection...")
+ response = await self.client.get(url, params=params)
+ response.raise_for_status()
+
+ return response.json()
+
+ async def connect_websocket(
+ self, subscriptions: list[dict[str, Any]], writer: LiveUpdateWriter
+ ) -> dict[str, int]:
+ """
+ Connect to WebSocket and listen for real-time updates.
+
+ Args:
+ subscriptions: List of subscription dicts with SportType, SportSubType, etc.
+ writer: LiveUpdateWriter instance for saving updates
+
+ Returns:
+ Dict with message and update counts
+ """
+ if not self.connection_token:
+ raise ValueError("Must call negotiate() and start_connection() first")
+
+ # Properly URL-encode parameters
+ connection_data = '[{"name":"gbshub"}]'
+ ws_url = (
+ f"{self.WS_BASE}/connect"
+ f"?transport=webSockets"
+ f"&clientProtocol=1.5"
+ f"&connectionToken={quote(self.connection_token, safe='')}"
+ f"&connectionData={quote(connection_data, safe='')}"
+ f"&tid=7"
+ )
+
+ logger.info("Connecting to WebSocket...")
+ logger.debug(f"WebSocket URL: {ws_url}")
+
+ # Add headers that SignalR expects
+ headers = {
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
+ "Origin": "https://www.overtime.ag",
+ }
+
+ async with websockets.connect(ws_url, additional_headers=headers) as websocket:
+ logger.info("WebSocket connected! Listening for updates...")
+
+ # Subscribe to multiple basketball categories
+ # Format: subscribeSport({ SportType, SportSubType, Store, Type })
+ subscriptions = [
+ {
+ "SportType": "Basketball",
+ "SportSubType": "College Basketball",
+ "Store": "",
+ "Type": 1,
+ },
+ {
+ "SportType": "Basketball",
+ "SportSubType": "College Extra",
+ "Store": "",
+ "Type": 1,
+ },
+ ]
+
+ # Send subscription requests
+ for idx, sub in enumerate(subscriptions, start=1):
+ subscribe_msg = json.dumps(
+ {"H": "gbshub", "M": "subscribeSport", "A": [sub], "I": idx}
+ )
+ await websocket.send(subscribe_msg)
+ logger.info(f"Sent subscription request for {sub['SportSubType']}...")
+ await asyncio.sleep(0.2) # Small delay between subscriptions
+
+ # Wait for subscription confirmations
+ await asyncio.sleep(1)
+
+ # Listen for messages
+ message_count = 0
+ odds_updates = 0
+ subscriptions_confirmed = set()
+
+ async for message in websocket:
+ message_count += 1
+
+ # Parse message
+ try:
+ data = json.loads(message)
+
+ # Check for subscription confirmation
+ if "I" in data and "E" not in data and data.get("I") in ["1", "2"]:
+ sub_id = data["I"]
+ subscriptions_confirmed.add(sub_id)
+ logger.info(f"✓ Subscription {sub_id} confirmed!")
+ if len(subscriptions_confirmed) == len(subscriptions):
+ logger.info("All subscriptions active. Waiting for odds updates...")
+ continue
+
+ # Check for errors
+ if "E" in data:
+ logger.error(f"SignalR Error: {data['E']}")
+ continue
+
+ # Check for SignalR hub messages (odds updates)
+ if "M" in data and data["M"]:
+ odds_updates += 1
+ logger.info(f"\n[Odds Update {odds_updates}]")
+ for msg in data["M"]:
+ hub_method = msg.get("M")
+ hub_data = msg.get("A", [])
+
+ logger.info(f" Hub: {msg.get('H')}, Method: {hub_method}")
+ logger.info(f" Data: {json.dumps(hub_data, indent=4)}")
+
+ # Save update to storage
+ update_record = {
+ "hub": msg.get("H"),
+ "method": hub_method,
+ "data": hub_data,
+ "raw_message": data,
+ }
+ # Save with first subscription's metadata (could be enhanced)
+ writer.write(
+ update_record,
+ subscriptions[0]["SportType"],
+ subscriptions[0]["SportSubType"],
+ )
+
+ # Log keepalive messages quietly
+ elif data == {}:
+ logger.debug(f"[Keepalive {message_count}]")
+
+ # Log connection messages
+ elif "C" in data:
+ logger.debug(f"[Connection message: {data.get('C')}]")
+
+ # Unknown message format
+ else:
+ logger.info(f"\n[Message {message_count}] {json.dumps(data, indent=2)}")
+
+ except json.JSONDecodeError:
+ logger.warning(f"Could not parse as JSON: {message[:200]}...")
+
+ # Return stats
+ return {"messages": message_count, "odds_updates": odds_updates}
+
+ async def close(self) -> None:
+ """Close HTTP client."""
+ await self.client.aclose()
+
+
+async def demo_rest_scraper() -> None:
+ """Demonstrate REST API scraping."""
+ scraper = OvertimeAgRESTScraper()
+
+ try:
+ # Get College Basketball odds
+ sport_type = "Basketball"
+ sport_subtype = "College Basketball"
+ ncaa_data = await scraper.get_sport_offering(
+ sport_type=sport_type,
+ sport_subtype=sport_subtype,
+ )
+
+ game_lines = ncaa_data.get("GameLines", [])
+ logger.info(f"\n{'=' * 80}")
+ logger.info(f"Found {len(game_lines)} College Basketball games")
+ logger.info(f"{'=' * 80}\n")
+
+ # Save snapshot to Parquet
+ filepath = save_snapshot(ncaa_data, sport_type, sport_subtype)
+
+ # Display first game as example
+ if game_lines:
+ game = game_lines[0]
+ logger.info(f"Game: {game['Team1ID']} @ {game['Team2ID']}")
+ logger.info(f"Time: {game['GameDateTimeString']}")
+ logger.info(f"Spread: {game['Team1ID']} {game['Spread1']} ({game['SpreadAdj1']})")
+ logger.info(f" {game['Team2ID']} {game['Spread2']} ({game['SpreadAdj2']})")
+ logger.info(f"Moneyline: {game['Team1ID']} {game['MoneyLine1']}")
+ logger.info(f" {game['Team2ID']} {game['MoneyLine2']}")
+ logger.info(
+ f"Total: {game['TotalPoints']} (O/U: {game['TtlPtsAdj1']}/{game['TtlPtsAdj2']})"
+ )
+ logger.info(f"Rotation: {game['Team1RotNum']} / {game['Team2RotNum']}")
+
+ if filepath:
+ logger.info(f"\n[OK] Snapshot saved: {filepath}")
+
+ finally:
+ await scraper.close()
+
+
+async def demo_websocket_scraper(duration_seconds: int = 300) -> None:
+ """
+ Demonstrate WebSocket scraping.
+
+ Args:
+ duration_seconds: How long to listen for updates
+ """
+ scraper = OvertimeAgWebSocketScraper()
+ stats = {"messages": 0, "odds_updates": 0}
+
+ # Define subscriptions
+ subscriptions = [
+ {"SportType": "Basketball", "SportSubType": "College Basketball", "Store": "", "Type": 1},
+ {"SportType": "Basketball", "SportSubType": "College Extra", "Store": "", "Type": 1},
+ ]
+
+ # Create writer for this session
+ writer = LiveUpdateWriter()
+ logger.info(f"Live updates will be saved to: {writer.filepath}")
+
+ try:
+ # Establish connection
+ await scraper.negotiate()
+ await scraper.start_connection()
+
+ # Listen for updates with timeout
+ try:
+ stats = await asyncio.wait_for(
+ scraper.connect_websocket(subscriptions, writer),
+ timeout=duration_seconds,
+ )
+ except TimeoutError:
+ logger.info(f"\n{'=' * 80}")
+ logger.info(f"Stopped after {duration_seconds} seconds")
+ logger.info(f"Messages received: {stats['messages']}")
+ logger.info(f"Odds updates: {stats['odds_updates']}")
+ if stats["odds_updates"] == 0:
+ logger.info("(No odds changed during monitoring period)")
+ else:
+ logger.info(f"[OK] Updates saved to {LIVE_UPDATES_DIR}")
+ logger.info(f"{'=' * 80}")
+
+ finally:
+ await scraper.close()
+
+
+async def main() -> None:
+ """Run both scrapers."""
+ # Initialize data storage
+ ensure_data_dirs()
+
+ logger.info("=" * 80)
+ logger.info("Overtime.ag Scraper Demo")
+ logger.info("=" * 80)
+
+ # Demo 1: REST API
+ logger.info("\n1. REST API Scraper")
+ logger.info("-" * 80)
+ await demo_rest_scraper()
+
+ # Demo 2: WebSocket (5 minutes to catch live updates)
+ logger.info("\n\n2. WebSocket Scraper (Real-time updates)")
+ logger.info("-" * 80)
+ await demo_websocket_scraper(duration_seconds=300)
+
+
+if __name__ == "__main__":
+ asyncio.run(main())
diff --git a/scripts/archive/2026-02-deprecated/overtime_scraper_adapter.py b/scripts/archive/2026-02-deprecated/overtime_scraper_adapter.py
new file mode 100644
index 000000000..aea133837
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/overtime_scraper_adapter.py
@@ -0,0 +1,240 @@
+"""Adapter for scraping overtime.ag betting data using chrome-devtools MCP."""
+
+import logging
+import re
+from datetime import datetime
+from decimal import Decimal
+from typing import Any
+
+from sports_betting_edge.core.tracking.overtime import (
+ AccountBalance,
+ DailyFigure,
+ DailyFiguresSnapshot,
+ OpenBet,
+ OpenBetsSnapshot,
+ OvertimeSnapshot,
+)
+
+logger = logging.getLogger(__name__)
+
+
+class OvertimeScraperAdapter:
+ """Adapter for scraping overtime.ag data via chrome-devtools MCP."""
+
+ BASE_URL = "https://overtime.ag/sports#/"
+ OPEN_BETS_URL = "https://overtime.ag/sports#/openBets"
+ DAILY_FIGURES_URL = "https://overtime.ag/sports#/dailyFigures"
+
+ def __init__(self, chrome_devtools_client: object) -> None:
+ """Initialize the scraper with a chrome-devtools MCP client.
+
+ Args:
+ chrome_devtools_client: The MCP client for chrome-devtools operations
+ """
+ self.client = chrome_devtools_client
+
+ def _parse_currency(self, value: str) -> Decimal:
+ """Parse a currency string to Decimal.
+
+ Args:
+ value: Currency string like "$1,821.01" or "$-120.00"
+
+ Returns:
+ Decimal value
+ """
+ # Remove $ and commas, keep negative sign
+ cleaned = value.replace("$", "").replace(",", "").strip()
+ return Decimal(cleaned)
+
+ def _parse_bet_details(self, details: str) -> dict[str, str | None]:
+ """Parse bet details string into components.
+
+ Args:
+ details: String like "Basketball - 306513 Houston Christian +9½ -120 for Game"
+
+ Returns:
+ Dict with rotation_number, team, line, odds
+ """
+ # Pattern: "Sport - RotNum Team Line Odds for Game"
+ pattern = r"^.*?-\s*(\d+)\s+(.+?)\s+([\+\-][\d½\.]+)\s+([\+\-]\d+)"
+ match = re.search(pattern, details)
+
+ if match:
+ return {
+ "rotation_number": match.group(1),
+ "team": match.group(2).strip(),
+ "line": match.group(3),
+ "odds": match.group(4),
+ }
+
+ # For totals: "Basketball - 871 Syracuse/North Carolina over 158 -110 for Game"
+ total_pattern = r"^.*?-\s*(\d+)\s+(.+?)\s+(over|under)\s+([\d\.]+)\s+([\+\-]\d+)"
+ total_match = re.search(total_pattern, details)
+
+ if total_match:
+ return {
+ "rotation_number": total_match.group(1),
+ "team": total_match.group(2).strip(),
+ "line": f"{total_match.group(3)} {total_match.group(4)}",
+ "odds": total_match.group(5),
+ }
+
+ return {
+ "rotation_number": None,
+ "team": None,
+ "line": None,
+ "odds": None,
+ }
+
+ def _parse_date_time(self, date_str: str, time_str: str) -> datetime:
+ """Parse date and time strings into datetime.
+
+ Args:
+ date_str: Date string like "MON 2/2"
+ time_str: Time string like "7:17 PM"
+
+ Returns:
+ Datetime object (uses current year)
+ """
+ # Extract month and day from "MON 2/2"
+ date_match = re.search(r"(\d+)/(\d+)", date_str)
+ if not date_match:
+ raise ValueError(f"Could not parse date: {date_str}")
+
+ month = int(date_match.group(1))
+ day = int(date_match.group(2))
+ current_year = datetime.now().year
+
+ # Parse time "7:17 PM"
+ time_obj = datetime.strptime(time_str, "%I:%M %p")
+
+ return datetime(
+ current_year,
+ month,
+ day,
+ time_obj.hour,
+ time_obj.minute,
+ )
+
+ async def scrape_account_balance(self, snapshot_data: dict[str, str]) -> AccountBalance:
+ """Extract account balance from menu snapshot data.
+
+ Args:
+ snapshot_data: Dict with balance, credit_limit, pending, etc.
+
+ Returns:
+ AccountBalance model
+ """
+ return AccountBalance(
+ balance=self._parse_currency(snapshot_data["balance"]),
+ credit_limit=self._parse_currency(snapshot_data["credit_limit"]),
+ pending=self._parse_currency(snapshot_data["pending"]),
+ available_balance=self._parse_currency(snapshot_data["available_balance"]),
+ casino_balance=self._parse_currency(snapshot_data.get("casino_balance", "$0.00")),
+ )
+
+ async def scrape_open_bets(self, snapshot_data: list[dict[str, str]]) -> OpenBetsSnapshot:
+ """Extract open bets from page snapshot data.
+
+ Args:
+ snapshot_data: List of dicts with bet information
+
+ Returns:
+ OpenBetsSnapshot model
+ """
+ bets: list[OpenBet] = []
+
+ for bet_data in snapshot_data:
+ # Parse date and time
+ accepted_date = self._parse_date_time(bet_data["date"], bet_data["time"])
+
+ # Parse bet details
+ details_parsed = self._parse_bet_details(bet_data["details"])
+
+ bet = OpenBet(
+ ticket_number=bet_data["ticket_number"],
+ accepted_date=accepted_date,
+ bet_type=bet_data["bet_type"],
+ details=bet_data["details"],
+ rotation_number=details_parsed["rotation_number"],
+ team=details_parsed["team"],
+ line=details_parsed["line"],
+ odds=details_parsed["odds"],
+ risk_amount=self._parse_currency(bet_data["risk"]),
+ to_win_amount=self._parse_currency(bet_data["to_win"]),
+ )
+ bets.append(bet)
+
+ total_risk = sum(bet.risk_amount for bet in bets) or Decimal("0.00")
+ total_to_win = sum(bet.to_win_amount for bet in bets) or Decimal("0.00")
+
+ return OpenBetsSnapshot(
+ bets=bets,
+ total_risk=total_risk,
+ total_to_win=total_to_win,
+ )
+
+ async def scrape_daily_figures(self, snapshot_data: dict[str, Any]) -> DailyFiguresSnapshot:
+ """Extract daily figures from page snapshot data.
+
+ Args:
+ snapshot_data: Dict with current_week, last_week, past_weeks data
+
+ Returns:
+ DailyFiguresSnapshot model
+ """
+
+ def parse_week(week_data: dict[str, str]) -> DailyFigure:
+ # Parse starting date "02/02/2026"
+ date_obj = datetime.strptime(week_data["starting_date"], "%m/%d/%Y")
+
+ return DailyFigure(
+ date=date_obj,
+ starting_balance=self._parse_currency(week_data["starting_balance"]),
+ monday=self._parse_currency(week_data.get("monday", "$0.00")),
+ tuesday=self._parse_currency(week_data.get("tuesday", "$0.00")),
+ wednesday=self._parse_currency(week_data.get("wednesday", "$0.00")),
+ thursday=self._parse_currency(week_data.get("thursday", "$0.00")),
+ friday=self._parse_currency(week_data.get("friday", "$0.00")),
+ saturday=self._parse_currency(week_data.get("saturday", "$0.00")),
+ sunday=self._parse_currency(week_data.get("sunday", "$0.00")),
+ week_total=self._parse_currency(week_data["week_total"]),
+ payments=self._parse_currency(week_data.get("payments", "$0.00")),
+ ending_balance=self._parse_currency(week_data["balance"]),
+ )
+
+ current_week = parse_week(snapshot_data["current_week"])
+ last_week = parse_week(snapshot_data["last_week"])
+ past_weeks = [parse_week(week) for week in snapshot_data.get("past_weeks", [])]
+
+ return DailyFiguresSnapshot(
+ current_week=current_week,
+ last_week=last_week,
+ past_weeks=past_weeks,
+ )
+
+ async def scrape_full_snapshot(
+ self,
+ balance_data: dict[str, str],
+ open_bets_data: list[dict[str, str]],
+ daily_figures_data: dict[str, dict[str, str]],
+ ) -> OvertimeSnapshot:
+ """Create a complete snapshot from all scraped data.
+
+ Args:
+ balance_data: Account balance information
+ open_bets_data: Open bets information
+ daily_figures_data: Daily figures information
+
+ Returns:
+ Complete OvertimeSnapshot
+ """
+ account_balance = await self.scrape_account_balance(balance_data)
+ open_bets = await self.scrape_open_bets(open_bets_data)
+ daily_figures = await self.scrape_daily_figures(daily_figures_data)
+
+ return OvertimeSnapshot(
+ account_balance=account_balance,
+ open_bets=open_bets,
+ daily_figures=daily_figures,
+ )
diff --git a/scripts/archive/2026-02-deprecated/overtime_tracker_service.py b/scripts/archive/2026-02-deprecated/overtime_tracker_service.py
new file mode 100644
index 000000000..df1fbdeb4
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/overtime_tracker_service.py
@@ -0,0 +1,275 @@
+"""Service for tracking overtime.ag betting data."""
+
+import logging
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import (
+ append_to_parquet,
+ ensure_dir,
+ read_parquet_df,
+ write_json,
+)
+from sports_betting_edge.adapters.overtime_scraper import OvertimeScraperAdapter
+from sports_betting_edge.core.tracking.overtime import OvertimeSnapshot
+
+logger = logging.getLogger(__name__)
+
+
+class OvertimeTrackerService:
+ """Service for tracking and persisting overtime.ag betting data."""
+
+ def __init__(
+ self,
+ scraper: OvertimeScraperAdapter,
+ data_dir: Path,
+ ) -> None:
+ """Initialize the tracker service.
+
+ Args:
+ scraper: The overtime.ag scraper adapter
+ data_dir: Base directory to store tracker parquet files (uses data_dir/tracker)
+ """
+ self.scraper = scraper
+ self.data_dir = ensure_dir(data_dir)
+ self.tracker_dir = ensure_dir(self.data_dir / "tracker")
+
+ def _get_snapshot_dir(self, snapshot_type: str, date: datetime | None = None) -> Path:
+ """Get the directory for a snapshot file.
+
+ Args:
+ snapshot_type: Type of snapshot (account_balance, open_bets, daily_figures, full)
+ date: Date for the snapshot (defaults to today)
+
+ Returns:
+ Path to the snapshot directory
+ """
+ if date is None:
+ date = datetime.now()
+
+ date_partition = date.strftime("%Y-%m-%d")
+ return self.tracker_dir / snapshot_type / date_partition
+
+ def _get_snapshot_path(
+ self,
+ snapshot_type: str,
+ date: datetime | None = None,
+ filename: str | None = None,
+ ) -> Path:
+ """Get the path for a snapshot file."""
+ base_dir = self._get_snapshot_dir(snapshot_type, date)
+ name = filename or f"{snapshot_type}.parquet"
+ return base_dir / name
+
+ def _save_to_parquet(
+ self,
+ data: list[dict[str, object]],
+ snapshot_type: str,
+ date: datetime | None = None,
+ ) -> Path:
+ """Save data to parquet file (append if exists).
+
+ Args:
+ data: List of dicts to save
+ snapshot_type: Type of snapshot
+ date: Date for the snapshot
+
+ Returns:
+ Path to saved file
+ """
+ path = self._get_snapshot_path(snapshot_type, date)
+ append_to_parquet(data, path)
+ logger.info(f"Saved {len(data)} records to {path}")
+
+ return path
+
+ async def save_full_snapshot(self, snapshot: OvertimeSnapshot) -> dict[str, Path]:
+ """Save a complete snapshot to parquet files.
+
+ Args:
+ snapshot: Complete overtime snapshot
+
+ Returns:
+ Dict mapping snapshot type to saved file path
+ """
+ saved_paths: dict[str, Path] = {}
+
+ # Save account balance
+ account_data = [snapshot.account_balance.model_dump()]
+ saved_paths["account"] = self._save_to_parquet(
+ account_data, "account_balance", snapshot.snapshot_time
+ )
+
+ # Save open bets
+ if snapshot.open_bets.bets:
+ open_bets_data = [
+ {
+ "snapshot_time": snapshot.snapshot_time,
+ **bet.model_dump(),
+ }
+ for bet in snapshot.open_bets.bets
+ ]
+ saved_paths["open_bets"] = self._save_to_parquet(
+ open_bets_data, "open_bets", snapshot.snapshot_time
+ )
+
+ # Save daily figures - current week
+ current_week_data = [
+ {
+ "snapshot_time": snapshot.snapshot_time,
+ "period": "current_week",
+ **snapshot.daily_figures.current_week.model_dump(),
+ }
+ ]
+ saved_paths["daily_current"] = self._save_to_parquet(
+ current_week_data, "daily_figures", snapshot.snapshot_time
+ )
+
+ # Save daily figures - last week
+ last_week_data = [
+ {
+ "snapshot_time": snapshot.snapshot_time,
+ "period": "last_week",
+ **snapshot.daily_figures.last_week.model_dump(),
+ }
+ ]
+ saved_paths["daily_last"] = self._save_to_parquet(
+ last_week_data, "daily_figures", snapshot.snapshot_time
+ )
+
+ # Save daily figures - past weeks
+ if snapshot.daily_figures.past_weeks:
+ past_weeks_data = [
+ {
+ "snapshot_time": snapshot.snapshot_time,
+ "period": "past_week",
+ **week.model_dump(),
+ }
+ for week in snapshot.daily_figures.past_weeks
+ ]
+ saved_paths["daily_past"] = self._save_to_parquet(
+ past_weeks_data, "daily_figures", snapshot.snapshot_time
+ )
+
+ # Save complete snapshot as JSON for reference
+ full_snapshot_dir = ensure_dir(
+ self._get_snapshot_dir("full_snapshot", snapshot.snapshot_time)
+ )
+ timestamp = snapshot.snapshot_time.strftime("%H-%M-%S")
+ full_snapshot_path = full_snapshot_dir / f"full_snapshot_{timestamp}.json"
+ write_json(snapshot.model_dump(mode="json"), full_snapshot_path)
+ saved_paths["full"] = full_snapshot_path
+
+ logger.info(f"Saved complete snapshot with {len(saved_paths)} files")
+ return saved_paths
+
+ def get_latest_balance(self) -> dict[str, object] | None:
+ """Get the most recent account balance.
+
+ Returns:
+ Latest balance data or None if no data exists
+ """
+ balance_dir = self.tracker_dir / "account_balance"
+ if not balance_dir.exists():
+ return None
+
+ # Find most recent file
+ files = sorted(balance_dir.rglob("account_balance.parquet"))
+ if not files:
+ return None
+
+ df = read_parquet_df(files[-1])
+ if df.empty:
+ return None
+
+ result = df.iloc[-1].to_dict()
+ return {str(k): v for k, v in result.items()}
+
+ def get_open_bets_summary(self, date: datetime | None = None) -> pd.DataFrame | None:
+ """Get summary of open bets for a date.
+
+ Args:
+ date: Date to get bets for (defaults to today)
+
+ Returns:
+ DataFrame with open bets or None
+ """
+ bets_path = self._get_snapshot_path("open_bets", date)
+ if not bets_path.exists():
+ return None
+
+ return read_parquet_df(bets_path)
+
+ def get_performance_summary(self, weeks: int = 4) -> pd.DataFrame | None:
+ """Get performance summary for recent weeks.
+
+ Args:
+ weeks: Number of weeks to include
+
+ Returns:
+ DataFrame with weekly performance
+ """
+ figures_dir = self.tracker_dir / "daily_figures"
+ if not figures_dir.exists():
+ return None
+
+ # Get most recent file
+ files = sorted(figures_dir.rglob("daily_figures.parquet"))
+ if not files:
+ return None
+
+ df = read_parquet_df(files[-1])
+
+ # Filter to recent weeks and calculate metrics
+ df["win_rate"] = df.apply(
+ lambda row: (
+ sum(
+ 1
+ for day in [
+ "monday",
+ "tuesday",
+ "wednesday",
+ "thursday",
+ "friday",
+ "saturday",
+ "sunday",
+ ]
+ if row.get(day, 0) > 0
+ )
+ / sum(
+ 1
+ for day in [
+ "monday",
+ "tuesday",
+ "wednesday",
+ "thursday",
+ "friday",
+ "saturday",
+ "sunday",
+ ]
+ if row.get(day, 0) != 0
+ )
+ if sum(
+ 1
+ for day in [
+ "monday",
+ "tuesday",
+ "wednesday",
+ "thursday",
+ "friday",
+ "saturday",
+ "sunday",
+ ]
+ if row.get(day, 0) != 0
+ )
+ > 0
+ else 0
+ ),
+ axis=1,
+ )
+
+ df["roi"] = (df["week_total"] / df["starting_balance"]) * 100
+
+ return df.head(weeks)
diff --git a/scripts/archive/2026-02-deprecated/plot_overtime_line_movement.py b/scripts/archive/2026-02-deprecated/plot_overtime_line_movement.py
new file mode 100644
index 000000000..517a174de
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/plot_overtime_line_movement.py
@@ -0,0 +1,421 @@
+#!/usr/bin/env python3
+"""Plot overtime.ag line movements and save to artifacts/.
+
+Usage:
+ uv run python scripts/plot_overtime_line_movement.py --list-games
+ uv run python scripts/plot_overtime_line_movement.py --list-games --date 2026-02-03
+ uv run python scripts/plot_overtime_line_movement.py --plot-all --date 2026-02-04
+ uv run python scripts/plot_overtime_line_movement.py --away "Kansas" --home "Texas Tech"
+ uv run python scripts/plot_overtime_line_movement.py --away "Kansas" --home "Texas Tech" \
+ --date 2026-02-03
+ uv run python scripts/plot_overtime_line_movement.py --game-num 601
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import re
+import sqlite3
+from collections.abc import Iterable
+from dataclasses import dataclass
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import ensure_dir, save_matplotlib_figure
+
+logger = logging.getLogger(__name__)
+
+DEFAULT_DB_PATH = Path("data/overtime/overtime_lines.db")
+DEFAULT_RAW_DIR = Path("data/raw")
+DEFAULT_ARTIFACTS_DIR = Path("artifacts")
+
+
+@dataclass(frozen=True)
+class PlotConfig:
+ output_path: Path
+ title: str
+ away_team: str | None = None
+ home_team: str | None = None
+
+
+def _slugify(value: str) -> str:
+ value = value.lower().strip()
+ value = re.sub(r"[^a-z0-9]+", "_", value)
+ return value.strip("_") or "plot"
+
+
+def _build_output_path(label: str, output: Path | None) -> Path:
+ if output is not None:
+ return output
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+ filename = f"overtime_line_movement_{_slugify(label)}_{timestamp}.png"
+ return DEFAULT_ARTIFACTS_DIR / filename
+
+
+def _build_output_path_for_matchup(
+ output_dir: Path,
+ date_str: str,
+ away: str,
+ home: str,
+) -> Path:
+ base = f"{_slugify(away)}_at_{_slugify(home)}_{date_str}"
+ candidate = output_dir / f"overtime_line_movement_{base}.png"
+ if not candidate.exists():
+ return candidate
+ suffix = 2
+ while True:
+ candidate = output_dir / f"overtime_line_movement_{base}_{suffix}.png"
+ if not candidate.exists():
+ return candidate
+ suffix += 1
+
+
+def _load_snapshots(
+ db_path: Path, away: str, home: str, date_str: str | None = None
+) -> pd.DataFrame:
+ query = """
+ SELECT captured_at, spread_magnitude, total_points,
+ spread_favorite_price, total_over_price, favorite_team
+ FROM overtime_line_snapshots
+ WHERE away_team = ? AND home_team = ?
+ ORDER BY captured_at
+ """
+ params: tuple[str, str] | tuple[str, str, str]
+ if date_str:
+ query = query.replace("ORDER BY", "AND date(captured_at) = ? ORDER BY")
+ params = (away, home, date_str)
+ else:
+ params = (away, home)
+
+ with sqlite3.connect(db_path) as conn:
+ df = pd.read_sql_query(query, conn, params=params)
+ return df
+
+
+def _get_latest_snapshot_date(db_path: Path) -> str | None:
+ with sqlite3.connect(db_path) as conn:
+ row = conn.execute("SELECT date(max(captured_at)) FROM overtime_line_snapshots").fetchone()
+ return row[0] if row and row[0] else None
+
+
+def _list_games(db_path: Path, date_str: str | None) -> tuple[str | None, list[tuple[str, str]]]:
+ effective_date = date_str or _get_latest_snapshot_date(db_path)
+ if effective_date is None:
+ return None, []
+
+ query = """
+ SELECT away_team, home_team
+ FROM overtime_line_snapshots
+ WHERE date(captured_at) = ?
+ GROUP BY away_team, home_team
+ ORDER BY away_team, home_team
+ """
+ with sqlite3.connect(db_path) as conn:
+ rows = conn.execute(query, (effective_date,)).fetchall()
+ return effective_date, [(row[0], row[1]) for row in rows]
+
+
+def _validate_date(date_str: str | None) -> str | None:
+ if date_str is None:
+ return None
+ try:
+ datetime.strptime(date_str, "%Y-%m-%d")
+ except ValueError as exc:
+ raise SystemExit("Invalid --date format. Use YYYY-MM-DD.") from exc
+ return date_str
+
+
+def _find_latest_parquet(raw_dir: Path) -> Path | None:
+ files = sorted(raw_dir.glob("overtime_lines_*.parquet"))
+ return files[-1] if files else None
+
+
+def _signed_spread_series(df: pd.DataFrame) -> pd.Series:
+ if "raw_line" in df.columns and df["raw_line"].notna().any():
+ return df["raw_line"].astype(float)
+ sign = df.get("side_role").map({"FAVORITE": -1, "UNDERDOG": 1})
+ return df["line_points"].astype(float) * sign.astype(float)
+
+
+def _resolve_favorite_team(df: pd.DataFrame) -> str | None:
+ if "favorite_team" not in df.columns:
+ return None
+ values = df["favorite_team"].dropna().astype(str).str.strip()
+ values = values[values != ""]
+ if values.empty:
+ return None
+ return values.value_counts().idxmax()
+
+
+def _plot_snapshot_movements(df: pd.DataFrame, config: PlotConfig) -> Path:
+ try:
+ import matplotlib.pyplot as plt # type: ignore
+ except ImportError as exc:
+ raise RuntimeError("matplotlib not installed. Install with: uv add matplotlib") from exc
+
+ df = df.copy()
+ df["captured_at"] = pd.to_datetime(df["captured_at"])
+
+ fig, ax1 = plt.subplots(figsize=(10, 5))
+ spread_values = pd.to_numeric(df["spread_magnitude"], errors="coerce")
+ favorite = _resolve_favorite_team(df)
+ away = config.away_team
+ home = config.home_team
+ underdog: str | None = None
+ if favorite and away and home:
+ if favorite == away:
+ underdog = home
+ elif favorite == home:
+ underdog = away
+
+ if favorite and underdog:
+ ax1.plot(
+ df["captured_at"],
+ -spread_values,
+ label=f"Spread (favorite: {favorite})",
+ color="tab:blue",
+ )
+ ax1.plot(
+ df["captured_at"],
+ spread_values,
+ label=f"Spread (underdog: {underdog})",
+ color="tab:green",
+ )
+ else:
+ ax1.plot(
+ df["captured_at"],
+ spread_values,
+ label="Spread (magnitude)",
+ color="tab:blue",
+ )
+ ax1.set_ylabel("Spread (pts)")
+
+ ax2 = ax1.twinx()
+ total_values = pd.to_numeric(df["total_points"], errors="coerce")
+ ax2.plot(
+ df["captured_at"],
+ total_values,
+ label="Total line",
+ color="tab:orange",
+ )
+ ax2.set_ylabel("Total (pts)")
+
+ ax1.set_title(config.title)
+ handles1, labels1 = ax1.get_legend_handles_labels()
+ handles2, labels2 = ax2.get_legend_handles_labels()
+ ax1.legend(handles1 + handles2, labels1 + labels2, loc="best")
+ fig.autofmt_xdate()
+ fig.tight_layout()
+
+ return save_matplotlib_figure(fig, config.output_path, dpi=200, bbox_inches="tight")
+
+
+def _plot_signalr_movements(df: pd.DataFrame, config: PlotConfig) -> Path:
+ try:
+ import matplotlib.pyplot as plt # type: ignore
+ except ImportError as exc:
+ raise RuntimeError("matplotlib not installed. Install with: uv add matplotlib") from exc
+
+ df = df.copy()
+ df["timestamp"] = pd.to_datetime(df["timestamp"])
+
+ spreads = df[df["market_type"] == "SPREAD"].copy()
+ totals = df[df["market_type"].isin(["TOTAL", "L", "T"])].copy()
+
+ fig, axes = plt.subplots(2, 1, figsize=(10, 6), sharex=True)
+
+ if not spreads.empty:
+ spreads["signed_line"] = _signed_spread_series(spreads)
+ spread_label = "Spread line"
+ if "team" in spreads.columns:
+ teams = [t for t in spreads["team"].dropna().unique() if str(t).strip()]
+ if len(teams) == 1:
+ spread_label = f"Spread (team: {teams[0]})"
+ elif len(teams) >= 2:
+ spread_label = f"Spread (teams: {teams[0]} / {teams[1]})"
+ axes[0].plot(
+ spreads["timestamp"],
+ spreads["signed_line"],
+ marker="o",
+ linewidth=1,
+ label=spread_label,
+ color="tab:blue",
+ )
+ axes[0].invert_yaxis()
+ axes[0].legend(loc="best")
+ axes[0].set_ylabel("Spread")
+ axes[0].set_title(config.title)
+
+ if not totals.empty:
+ totals["line_points"] = totals["line_points"].astype(float)
+ axes[1].plot(
+ totals["timestamp"],
+ totals["line_points"],
+ marker="o",
+ linewidth=1,
+ label="Total line",
+ color="tab:orange",
+ )
+ axes[1].legend(loc="best")
+ axes[1].set_ylabel("Total")
+ axes[1].set_xlabel("Time")
+
+ fig.autofmt_xdate()
+ fig.tight_layout()
+
+ return save_matplotlib_figure(fig, config.output_path, dpi=200, bbox_inches="tight")
+
+
+def _validate_pair(values: Iterable[str | None]) -> bool:
+ return all(values) and any(values)
+
+
+def main() -> None:
+ logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+
+ parser = argparse.ArgumentParser(description="Plot Overtime.ag line movements to PNG.")
+ parser.add_argument("--away", type=str, default=None, help="Away team name")
+ parser.add_argument("--home", type=str, default=None, help="Home team name")
+ parser.add_argument("--game-num", type=int, default=None, help="Overtime game number")
+ parser.add_argument(
+ "--date",
+ type=str,
+ default=None,
+ help="Snapshot date filter (YYYY-MM-DD). Applies to --list-games and matchup plots.",
+ )
+ parser.add_argument(
+ "--list-games",
+ action="store_true",
+ help="List matchups from the snapshots DB and exit",
+ )
+ parser.add_argument(
+ "--plot-all",
+ action="store_true",
+ help="Plot all matchups for the date (or latest date) and exit",
+ )
+ parser.add_argument("--output", type=Path, default=None, help="PNG output path")
+ parser.add_argument("--db", type=Path, default=DEFAULT_DB_PATH, help="SQLite DB path")
+ parser.add_argument(
+ "--raw-dir",
+ type=Path,
+ default=DEFAULT_RAW_DIR,
+ help="Directory containing overtime_lines_*.parquet",
+ )
+ parser.add_argument(
+ "--output-dir",
+ type=Path,
+ default=None,
+ help="Directory for --plot-all output (default: artifacts/)",
+ )
+
+ args = parser.parse_args()
+ date_str = _validate_date(args.date)
+
+ if args.plot_all and args.game_num:
+ raise SystemExit("--plot-all cannot be combined with --game-num.")
+
+ if args.list_games:
+ if not args.db.exists():
+ raise SystemExit(f"DB not found: {args.db}")
+ latest_date, games = _list_games(args.db, date_str)
+ if not games:
+ raise SystemExit("No snapshot games found to list.")
+ label = latest_date if date_str else f"{latest_date} (latest snapshot date)"
+ print(f"Matchups for {label}:")
+ for away, home in games:
+ print(f"- {away} @ {home}")
+ return
+
+ if args.plot_all:
+ if not args.db.exists():
+ raise SystemExit(f"DB not found: {args.db}")
+ effective_date, games = _list_games(args.db, date_str)
+ if not games or effective_date is None:
+ raise SystemExit("No snapshot games found to plot.")
+
+ output_dir = args.output_dir or DEFAULT_ARTIFACTS_DIR
+ ensure_dir(output_dir)
+ logger.info(
+ "Plotting %d matchups for %s into %s",
+ len(games),
+ effective_date,
+ output_dir,
+ )
+
+ successes: list[Path] = []
+ failures: list[str] = []
+ for away, home in games:
+ df = _load_snapshots(args.db, away, home, date_str=effective_date)
+ if df.empty:
+ failures.append(f"{away} @ {home} (no snapshots)")
+ continue
+
+ output_path = _build_output_path_for_matchup(output_dir, effective_date, away, home)
+ config = PlotConfig(
+ output_path=output_path,
+ title=f"Overtime.ag Line Movement: {away} @ {home}",
+ away_team=away,
+ home_team=home,
+ )
+ try:
+ saved = _plot_snapshot_movements(df, config)
+ successes.append(saved)
+ except Exception as exc:
+ failures.append(f"{away} @ {home} ({exc})")
+
+ logger.info("Saved %d plots.", len(successes))
+ if failures:
+ logger.warning("Failed to plot %d matchups.", len(failures))
+ for failure in failures:
+ logger.warning(" %s", failure)
+ return
+
+ if args.game_num and (args.away or args.home):
+ raise SystemExit("Provide either --game-num or both --away/--home (not both).")
+ if args.game_num is None and not _validate_pair([args.away, args.home]):
+ raise SystemExit("Provide both --away and --home, or use --game-num.")
+
+ if args.game_num is not None:
+ latest = _find_latest_parquet(args.raw_dir)
+ if latest is None:
+ raise SystemExit(f"No SignalR parquet files found in {args.raw_dir}")
+ df = pd.read_parquet(latest)
+ df = df[df["game_num"] == args.game_num].copy()
+ if df.empty:
+ raise SystemExit(f"No line changes found for game_num={args.game_num} in {latest}")
+
+ label = f"game_{args.game_num}"
+ output_path = _build_output_path(label, args.output)
+ ensure_dir(output_path.parent)
+ config = PlotConfig(
+ output_path=output_path,
+ title=f"Overtime.ag Line Changes (Game {args.game_num})",
+ )
+ saved = _plot_signalr_movements(df, config)
+ else:
+ if not args.db.exists():
+ raise SystemExit(f"DB not found: {args.db}")
+ df = _load_snapshots(args.db, args.away, args.home, date_str=date_str)
+ if df.empty:
+ date_label = f" on {date_str}" if date_str else ""
+ raise SystemExit(f"No snapshots found for {args.away} @ {args.home}{date_label}")
+
+ label = f"{args.away}_at_{args.home}"
+ output_path = _build_output_path(label, args.output)
+ ensure_dir(output_path.parent)
+ config = PlotConfig(
+ output_path=output_path,
+ title=f"Overtime.ag Line Movement: {args.away} @ {args.home}",
+ away_team=args.away,
+ home_team=args.home,
+ )
+ saved = _plot_snapshot_movements(df, config)
+
+ logger.info("Saved plot to %s", saved)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/predict_today.py b/scripts/archive/2026-02-deprecated/predict_today.py
new file mode 100644
index 000000000..018b41919
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/predict_today.py
@@ -0,0 +1,996 @@
+"""Generate daily predictions for today's NCAAB games.
+
+Uses trained XGBoost score regression models to predict game scores and derives
+mathematically consistent spread/total probabilities from those predictions.
+
+Output includes:
+- Predicted scores for home and away teams
+- Probability of favorite covering spread (derived from score predictions)
+- Probability of game going over total (derived from score predictions)
+- Expected value vs closing lines (for bankroll management)
+
+Usage:
+ uv run python scripts/predict_today.py
+ uv run python scripts/predict_today.py --date 2026-02-01
+ uv run python scripts/predict_today.py --output data/outputs/predictions/2026-02-01.csv
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from datetime import date
+from pathlib import Path
+
+import pandas as pd
+import xgboost as xgb
+from scipy import stats
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df, write_csv
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.core.team_mapper import TeamMapper
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+# Model uncertainty constants (from training performance)
+HOME_SCORE_MAE = 5.38 # Mean absolute error for home score predictions
+AWAY_SCORE_MAE = 5.00 # Mean absolute error for away score predictions
+COMBINED_STDDEV = 7.6 # Combined standard deviation: sqrt(5.38^2 + 5.00^2)
+
+# Spreads features from staging layer (50 features) - ORDER MATTERS!
+SPREADS_FEATURES = [
+ # Favorite team efficiency metrics
+ "fav_adj_em",
+ "fav_pythag",
+ "fav_adj_o",
+ "fav_adj_d",
+ "fav_adj_t",
+ "fav_luck",
+ "fav_sos",
+ "fav_height",
+ # Favorite team Four Factors
+ "fav_efg_pct",
+ "fav_to_pct",
+ "fav_or_pct",
+ "fav_ft_rate",
+ # Underdog team efficiency metrics
+ "dog_adj_em",
+ "dog_pythag",
+ "dog_adj_o",
+ "dog_adj_d",
+ "dog_adj_t",
+ "dog_luck",
+ "dog_sos",
+ "dog_height",
+ # Underdog team Four Factors
+ "dog_efg_pct",
+ "dog_to_pct",
+ "dog_or_pct",
+ "dog_ft_rate",
+ # Matchup differentials
+ "em_diff",
+ "pythag_diff",
+ "adj_o_diff",
+ "adj_d_diff",
+ "tempo_diff",
+ "luck_diff",
+ "sos_diff",
+ "height_diff",
+ # Matchup interaction features (offense vs defense)
+ "fav_offensive_efficiency",
+ "dog_offensive_efficiency",
+ "offensive_efficiency_diff",
+ "expected_margin",
+ # Line features
+ "opening_spread",
+ "closing_spread",
+ "line_movement",
+ # Rest & situational features
+ "fav_rest_days",
+ "dog_rest_days",
+ "fav_back_to_back",
+ "dog_back_to_back",
+ "fav_short_rest",
+ "dog_short_rest",
+ "fav_road_streak",
+ "dog_road_streak",
+ "fav_days_on_road",
+ "dog_days_on_road",
+ "rest_advantage",
+]
+
+# Totals features from staging layer (45 features)
+TOTALS_FEATURES = [
+ # Home team efficiency metrics
+ "home_adj_em",
+ "home_pythag",
+ "home_adj_o",
+ "home_adj_d",
+ "home_adj_t",
+ "home_luck",
+ "home_sos",
+ "home_height",
+ # Home team Four Factors
+ "home_efg_pct",
+ "home_to_pct",
+ "home_or_pct",
+ "home_ft_rate",
+ # Away team efficiency metrics
+ "away_adj_em",
+ "away_pythag",
+ "away_adj_o",
+ "away_adj_d",
+ "away_adj_t",
+ "away_luck",
+ "away_sos",
+ "away_height",
+ # Away team Four Factors
+ "away_efg_pct",
+ "away_to_pct",
+ "away_or_pct",
+ "away_ft_rate",
+ # Combined features
+ "total_offense",
+ "avg_tempo",
+ "avg_luck",
+ "height_diff",
+ "home_expected_pts",
+ "away_expected_pts",
+ "expected_total",
+ # Line features
+ "opening_total",
+ "closing_total",
+ "total_movement",
+ # Rest & situational features
+ "home_rest_days",
+ "away_rest_days",
+ "home_back_to_back",
+ "away_back_to_back",
+ "home_short_rest",
+ "away_short_rest",
+ "away_road_streak",
+ "away_days_on_road",
+ "rest_advantage",
+ "total_back_to_back",
+ "total_short_rest",
+]
+
+
+def load_models(
+ models_dir: Path, use_tuned: bool = True
+) -> tuple[xgb.XGBClassifier, xgb.XGBClassifier]:
+ """Load trained XGBoost models.
+
+ Args:
+ models_dir: Directory containing model files
+ use_tuned: Whether to use tuned models (default: True)
+
+ Returns:
+ (spreads_model, totals_model)
+ """
+ import pickle
+
+ # Use optimized models from Feb 5 2026 hyperparameter tuning
+ # spreads_2026_optimized_v2.pkl: 29 features, AUC 0.6598
+ # totals_2026_optimized_v2.pkl: 31 features, AUC 0.6819
+ spreads_filename = "spreads_2026_optimized_v2.pkl"
+ totals_filename = "totals_2026_optimized_v2.pkl"
+
+ spreads_path = models_dir / spreads_filename
+ totals_path = models_dir / totals_filename
+
+ if not spreads_path.exists():
+ raise FileNotFoundError(f"Spreads model not found: {spreads_path}")
+ if not totals_path.exists():
+ raise FileNotFoundError(f"Totals model not found: {totals_path}")
+
+ # Load pickle models
+ with open(spreads_path, "rb") as f:
+ spreads_model = pickle.load(f)
+
+ with open(totals_path, "rb") as f:
+ totals_model = pickle.load(f)
+
+ logger.info(f"Loaded spreads model from {spreads_path}")
+ logger.info(f"Loaded totals model from {totals_path}")
+
+ return spreads_model, totals_model
+
+
+def load_score_models(
+ models_dir: Path,
+) -> tuple[xgb.XGBRegressor, xgb.XGBRegressor]:
+ """Load trained score prediction models (REQUIRED).
+
+ These models are the primary prediction source. All probabilities are
+ mathematically derived from score predictions.
+
+ Args:
+ models_dir: Directory containing model files
+
+ Returns:
+ (home_model, away_model)
+ """
+ import pickle
+
+ home_path = models_dir / "home_score_2026.pkl"
+ away_path = models_dir / "away_score_2026.pkl"
+
+ if not home_path.exists():
+ raise FileNotFoundError(f"Home score model not found: {home_path}")
+ if not away_path.exists():
+ raise FileNotFoundError(f"Away score model not found: {away_path}")
+
+ with open(home_path, "rb") as f:
+ home_model = pickle.load(f)
+ with open(away_path, "rb") as f:
+ away_model = pickle.load(f)
+
+ logger.info(f"Loaded score models from {models_dir}")
+
+ return home_model, away_model
+
+
+def load_today_odds(odds_dir: Path, target_date: date) -> tuple[pd.DataFrame, pd.DataFrame]:
+ """Load today's odds from parquet files.
+
+ Args:
+ odds_dir: Directory containing daily odds files
+ target_date: Date to load odds for
+
+ Returns:
+ (spreads_df, totals_df)
+ """
+ date_str = target_date.isoformat()
+ spreads_path = odds_dir / f"{date_str}_spreads.parquet"
+ totals_path = odds_dir / f"{date_str}_totals.parquet"
+
+ if not spreads_path.exists():
+ raise FileNotFoundError(f"Spreads odds not found: {spreads_path}")
+ if not totals_path.exists():
+ raise FileNotFoundError(f"Totals odds not found: {totals_path}")
+
+ spreads = read_parquet_df(str(spreads_path))
+ totals = read_parquet_df(str(totals_path))
+
+ logger.info(f"Loaded {len(spreads)} spread records for {date_str}")
+ logger.info(f"Loaded {len(totals)} total records for {date_str}")
+
+ return spreads, totals
+
+
+def load_kenpom_data(
+ kenpom_dir: Path, season: int = 2026
+) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+ """Load KenPom ratings, four factors, and height data.
+
+ Args:
+ kenpom_dir: Directory containing KenPom data
+ season: Season year
+
+ Returns:
+ (ratings_df, four_factors_df, height_df)
+ """
+ ratings_path = kenpom_dir / "ratings" / "season" / f"ratings_{season}.parquet"
+ ff_path = kenpom_dir / "four-factors" / "season" / f"four-factors_{season}.parquet"
+ height_path = kenpom_dir / "height" / "season" / f"height_{season}.parquet"
+
+ if not ratings_path.exists():
+ raise FileNotFoundError(f"KenPom ratings not found: {ratings_path}")
+ if not ff_path.exists():
+ raise FileNotFoundError(f"Four factors not found: {ff_path}")
+ if not height_path.exists():
+ raise FileNotFoundError(f"Height data not found: {height_path}")
+
+ ratings = read_parquet_df(str(ratings_path))
+ four_factors = read_parquet_df(str(ff_path))
+ height = read_parquet_df(str(height_path))
+
+ logger.info(f"Loaded KenPom data for {len(ratings)} teams")
+
+ return ratings, four_factors, height
+
+
+def get_team_features(
+ team_name: str,
+ kenpom_ratings: pd.DataFrame,
+ kenpom_ff: pd.DataFrame,
+ kenpom_height: pd.DataFrame,
+ team_mapper: TeamMapper | None,
+ prefix: str = "",
+) -> dict[str, float]:
+ """Extract features for a single team.
+
+ Args:
+ team_name: Team name from odds data
+ kenpom_ratings: KenPom ratings DataFrame
+ kenpom_ff: KenPom four factors DataFrame
+ kenpom_height: KenPom height DataFrame
+ team_mapper: Team name mapper
+ prefix: Feature prefix (e.g., "home_" or "away_")
+
+ Returns:
+ Dictionary of features
+ """
+ features: dict[str, float] = {}
+
+ # Map team name to KenPom
+ if team_mapper:
+ kenpom_name = team_mapper.get_kenpom_name(team_name, source="odds_api")
+ else:
+ kenpom_name = team_name
+
+ # Get KenPom ratings (note: staging layer uses adj_o/adj_d/adj_t naming)
+ # For compatibility with KenPom API naming, support both conventions
+ team_ratings = kenpom_ratings[
+ kenpom_ratings.get("TeamName", kenpom_ratings.get("kenpom_name")) == kenpom_name
+ ]
+ if len(team_ratings) > 0:
+ rating = team_ratings.iloc[0]
+ features[f"{prefix}adj_em"] = rating.get("AdjEM", rating.get("adj_em", 0.0))
+ features[f"{prefix}pythag"] = rating.get("Pythag", rating.get("pythag", 0.0))
+ features[f"{prefix}adj_o"] = rating.get("AdjOE", rating.get("adj_o", 0.0))
+ features[f"{prefix}adj_d"] = rating.get("AdjDE", rating.get("adj_d", 0.0))
+ features[f"{prefix}adj_t"] = rating.get("AdjTempo", rating.get("adj_t", 0.0))
+ features[f"{prefix}luck"] = rating.get("Luck", rating.get("luck", 0.0))
+ features[f"{prefix}sos"] = rating.get("SOS", rating.get("sos", 0.0))
+ else:
+ logger.warning(f"No KenPom data for {kenpom_name} (from {team_name})")
+ features[f"{prefix}adj_em"] = 0.0
+ features[f"{prefix}pythag"] = 0.0
+ features[f"{prefix}adj_o"] = 0.0
+ features[f"{prefix}adj_d"] = 0.0
+ features[f"{prefix}adj_t"] = 0.0
+ features[f"{prefix}luck"] = 0.0
+ features[f"{prefix}sos"] = 0.0
+
+ # Get four factors
+ team_ff = kenpom_ff[kenpom_ff["TeamName"] == kenpom_name]
+ if len(team_ff) > 0:
+ ff = team_ff.iloc[0]
+ features[f"{prefix}efg_pct"] = ff.get("eFG_Pct", 0.0)
+ features[f"{prefix}to_pct"] = ff.get("TO_Pct", 0.0)
+ features[f"{prefix}or_pct"] = ff.get("OR_Pct", 0.0)
+ features[f"{prefix}ft_rate"] = ff.get("FT_Rate", 0.0)
+ features[f"{prefix}defg_pct"] = ff.get("DeFG_Pct", 0.0)
+ features[f"{prefix}dto_pct"] = ff.get("DTO_Pct", 0.0)
+ else:
+ features[f"{prefix}efg_pct"] = 0.0
+ features[f"{prefix}to_pct"] = 0.0
+ features[f"{prefix}or_pct"] = 0.0
+ features[f"{prefix}ft_rate"] = 0.0
+ features[f"{prefix}defg_pct"] = 0.0
+ features[f"{prefix}dto_pct"] = 0.0
+
+ # Get height
+ team_height = kenpom_height[kenpom_height["TeamName"] == kenpom_name]
+ if len(team_height) > 0:
+ height = team_height.iloc[0]
+ features[f"{prefix}height"] = height.get("HgtEff", 0.0)
+ else:
+ features[f"{prefix}height"] = 0.0
+
+ return features
+
+
+def build_prediction_features(
+ spreads: pd.DataFrame,
+ totals: pd.DataFrame,
+ kenpom_ratings: pd.DataFrame,
+ kenpom_ff: pd.DataFrame,
+ kenpom_height: pd.DataFrame,
+ team_mapper: TeamMapper | None,
+ odds_db_path: Path | None = None,
+ include_market_features: bool = True,
+ include_bookmaker_features: bool = True,
+) -> pd.DataFrame:
+ """Build feature matrix for predictions.
+
+ Args:
+ spreads: Spreads odds DataFrame (normalized)
+ totals: Totals odds DataFrame (normalized)
+ kenpom_ratings: KenPom ratings
+ kenpom_ff: KenPom four factors
+ kenpom_height: KenPom height data
+ team_mapper: Team name mapper
+
+ Returns:
+ DataFrame with features for each game
+ """
+ games = []
+
+ # Get unique games from spreads (use FanDuel as canonical bookmaker)
+ unique_games = spreads[spreads["bookmaker_key"] == "fanduel"].copy()
+
+ # Load rest features from staging events
+ staging_events_path = Path("data/staging/events.parquet")
+ rest_features_map = {}
+ if staging_events_path.exists():
+ events = read_parquet_df(str(staging_events_path))
+ # Index by event_id for fast lookup
+ rest_features_map = events.set_index("event_id")[
+ [
+ "home_rest_days",
+ "away_rest_days",
+ "home_back_to_back",
+ "away_back_to_back",
+ "home_short_rest",
+ "away_short_rest",
+ "away_road_streak",
+ "away_days_on_road",
+ ]
+ ].to_dict("index")
+ logger.info(f"Loaded rest features for {len(rest_features_map)} events")
+ else:
+ logger.warning(f"Staging events not found at {staging_events_path}")
+
+ odds_db: OddsAPIDatabase | None = None
+
+ if odds_db_path and odds_db_path.exists():
+ odds_db = OddsAPIDatabase(odds_db_path)
+
+ for _, game in unique_games.iterrows():
+ event_id = game["event_id"]
+ home_team = game["home_team"]
+ away_team = game["away_team"]
+ favorite_team = game["favorite_team"]
+ underdog_team = game["underdog_team"]
+ spread_magnitude = game["spread_magnitude"]
+
+ # Get KenPom features for favorite/underdog (for spreads model)
+ fav_features = get_team_features(
+ favorite_team, kenpom_ratings, kenpom_ff, kenpom_height, team_mapper, prefix="fav_"
+ )
+ dog_features = get_team_features(
+ underdog_team, kenpom_ratings, kenpom_ff, kenpom_height, team_mapper, prefix="dog_"
+ )
+
+ # Get KenPom features for home/away (for totals model)
+ home_features = get_team_features(
+ home_team, kenpom_ratings, kenpom_ff, kenpom_height, team_mapper, prefix="home_"
+ )
+ away_features = get_team_features(
+ away_team, kenpom_ratings, kenpom_ff, kenpom_height, team_mapper, prefix="away_"
+ )
+
+ # Get total for this game
+ game_totals = totals[
+ (totals["event_id"] == event_id) & (totals["bookmaker_key"] == "fanduel")
+ ]
+ total_points = game_totals.iloc[0]["total"] if len(game_totals) > 0 else None
+
+ # Spread line movement features
+ opening_spread = spread_magnitude
+ closing_spread = spread_magnitude
+ line_movement_points = 0.0
+
+ # Total line movement features
+ opening_total = total_points if total_points is not None else 0.0
+ closing_total = total_points if total_points is not None else 0.0
+ total_movement = 0.0
+
+ if odds_db:
+ spreads_db = odds_db.get_canonical_spreads(event_id=event_id, book_key="fanduel")
+ if len(spreads_db) > 0:
+ opening_spread = spreads_db.iloc[0]["spread_magnitude"]
+ closing_spread = spreads_db.iloc[-1]["spread_magnitude"]
+ line_movement_points = closing_spread - opening_spread
+
+ totals_db = odds_db.get_canonical_totals(event_id=event_id, book_key="fanduel")
+ if len(totals_db) > 0:
+ opening_total = totals_db.iloc[0]["total"]
+ closing_total = totals_db.iloc[-1]["total"]
+ total_movement = closing_total - opening_total
+
+ # Spreads derived features
+ fav_em = fav_features.get("fav_adj_em", 0.0)
+ dog_em = dog_features.get("dog_adj_em", 0.0)
+ em_diff = fav_em - dog_em
+
+ fav_pythag = fav_features.get("fav_pythag", 0.0)
+ dog_pythag = dog_features.get("dog_pythag", 0.0)
+ pythag_diff = fav_pythag - dog_pythag
+
+ fav_adj_o = fav_features.get("fav_adj_o", 0.0)
+ dog_adj_o = dog_features.get("dog_adj_o", 0.0)
+ adj_o_diff = fav_adj_o - dog_adj_o
+
+ fav_adj_d = fav_features.get("fav_adj_d", 0.0)
+ dog_adj_d = dog_features.get("dog_adj_d", 0.0)
+ adj_d_diff = fav_adj_d - dog_adj_d
+
+ fav_adj_t = fav_features.get("fav_adj_t", 0.0)
+ dog_adj_t = dog_features.get("dog_adj_t", 0.0)
+ tempo_diff = fav_adj_t - dog_adj_t
+
+ fav_luck = fav_features.get("fav_luck", 0.0)
+ dog_luck = dog_features.get("dog_luck", 0.0)
+ luck_diff = fav_luck - dog_luck
+
+ fav_sos = fav_features.get("fav_sos", 0.0)
+ dog_sos = dog_features.get("dog_sos", 0.0)
+ sos_diff = fav_sos - dog_sos
+
+ # Height differential (same for both spreads and totals, based on home vs away)
+ home_height = home_features.get("home_height", 0.0)
+ away_height = away_features.get("away_height", 0.0)
+ height_diff = home_height - away_height
+
+ # Totals derived features
+ home_luck = home_features.get("home_luck", 0.0)
+ away_luck = away_features.get("away_luck", 0.0)
+ avg_luck = (home_luck + away_luck) / 2.0
+
+ home_o = home_features.get("home_adj_o", 0.0)
+ away_o = away_features.get("away_adj_o", 0.0)
+ total_offense = home_o + away_o
+
+ home_t = home_features.get("home_adj_t", 0.0)
+ away_t = away_features.get("away_adj_t", 0.0)
+ avg_tempo = (home_t + away_t) / 2.0
+
+ # Expected points (KenPom formula)
+ home_d = home_features.get("home_adj_d", 0.0)
+ away_d = away_features.get("away_adj_d", 0.0)
+ home_expected_pts = (home_o * away_d / 100) * (home_t / 100)
+ away_expected_pts = (away_o * home_d / 100) * (away_t / 100)
+ expected_total = home_expected_pts + away_expected_pts
+
+ # Matchup interaction features (offense vs defense)
+ fav_offensive_efficiency = fav_adj_o * dog_adj_d / 100
+ dog_offensive_efficiency = dog_adj_o * fav_adj_d / 100
+ offensive_efficiency_diff = fav_offensive_efficiency - dog_offensive_efficiency
+ expected_margin = offensive_efficiency_diff * avg_tempo / 100
+
+ # Rest & situational features
+ rest_feats = rest_features_map.get(event_id, {})
+ home_rest_days = rest_feats.get("home_rest_days", 3)
+ away_rest_days = rest_feats.get("away_rest_days", 3)
+ home_back_to_back = rest_feats.get("home_back_to_back", False)
+ away_back_to_back = rest_feats.get("away_back_to_back", False)
+ home_short_rest = rest_feats.get("home_short_rest", False)
+ away_short_rest = rest_feats.get("away_short_rest", False)
+ away_road_streak = rest_feats.get("away_road_streak", 0)
+ away_days_on_road = rest_feats.get("away_days_on_road", 0)
+
+ # Map home/away rest features to favorite/underdog (for spreads model)
+ is_fav_home = favorite_team == home_team
+ fav_rest_days = home_rest_days if is_fav_home else away_rest_days
+ dog_rest_days = away_rest_days if is_fav_home else home_rest_days
+ fav_back_to_back = home_back_to_back if is_fav_home else away_back_to_back
+ dog_back_to_back = away_back_to_back if is_fav_home else home_back_to_back
+ fav_short_rest = home_short_rest if is_fav_home else away_short_rest
+ dog_short_rest = away_short_rest if is_fav_home else home_short_rest
+ fav_road_streak = 0 if is_fav_home else away_road_streak
+ dog_road_streak = away_road_streak if is_fav_home else 0
+ fav_days_on_road = 0 if is_fav_home else away_days_on_road
+ dog_days_on_road = away_days_on_road if is_fav_home else 0
+ rest_advantage = fav_rest_days - dog_rest_days
+
+ # Combined fatigue for totals
+ total_back_to_back = int(home_back_to_back or away_back_to_back)
+ total_short_rest = int(home_short_rest or away_short_rest)
+
+ # Combine all features (matching staging layer)
+ game_features = {
+ # Metadata (not used in models)
+ "event_id": event_id,
+ "commence_time": game["commence_time"],
+ "home_team": home_team,
+ "away_team": away_team,
+ "favorite_team": favorite_team,
+ "underdog_team": underdog_team,
+ "spread_magnitude": spread_magnitude,
+ "total_points": total_points,
+ # Spreads model features (32 features from staging layer)
+ **fav_features,
+ **dog_features,
+ "em_diff": em_diff,
+ "pythag_diff": pythag_diff,
+ "adj_o_diff": adj_o_diff,
+ "adj_d_diff": adj_d_diff,
+ "tempo_diff": tempo_diff,
+ "luck_diff": luck_diff,
+ "sos_diff": sos_diff,
+ "height_diff": height_diff,
+ "fav_offensive_efficiency": fav_offensive_efficiency,
+ "dog_offensive_efficiency": dog_offensive_efficiency,
+ "offensive_efficiency_diff": offensive_efficiency_diff,
+ "expected_margin": expected_margin,
+ "opening_spread": opening_spread,
+ "closing_spread": closing_spread,
+ "line_movement": line_movement_points,
+ # Rest & situational features (spreads model)
+ "fav_rest_days": fav_rest_days,
+ "dog_rest_days": dog_rest_days,
+ "fav_back_to_back": fav_back_to_back,
+ "dog_back_to_back": dog_back_to_back,
+ "fav_short_rest": fav_short_rest,
+ "dog_short_rest": dog_short_rest,
+ "fav_road_streak": fav_road_streak,
+ "dog_road_streak": dog_road_streak,
+ "fav_days_on_road": fav_days_on_road,
+ "dog_days_on_road": dog_days_on_road,
+ "rest_advantage": rest_advantage,
+ # Totals model features (45 features from staging layer)
+ **home_features,
+ **away_features,
+ "total_offense": total_offense,
+ "avg_tempo": avg_tempo,
+ "avg_luck": avg_luck,
+ "home_expected_pts": home_expected_pts,
+ "away_expected_pts": away_expected_pts,
+ "expected_total": expected_total,
+ "opening_total": opening_total,
+ "closing_total": closing_total,
+ "total_movement": total_movement,
+ # Rest & situational features (totals model)
+ "home_rest_days": home_rest_days,
+ "away_rest_days": away_rest_days,
+ "home_back_to_back": home_back_to_back,
+ "away_back_to_back": away_back_to_back,
+ "home_short_rest": home_short_rest,
+ "away_short_rest": away_short_rest,
+ "away_road_streak": away_road_streak,
+ "away_days_on_road": away_days_on_road,
+ "total_back_to_back": total_back_to_back,
+ "total_short_rest": total_short_rest,
+ }
+
+ games.append(game_features)
+
+ if odds_db:
+ odds_db.close()
+
+ return pd.DataFrame(games)
+
+
+def make_predictions(
+ features: pd.DataFrame,
+ home_score_model: xgb.XGBRegressor,
+ away_score_model: xgb.XGBRegressor,
+ spreads_model: xgb.XGBClassifier | None = None,
+ totals_model: xgb.XGBClassifier | None = None,
+) -> pd.DataFrame:
+ """Generate predictions for today's games.
+
+ Predictions are based on score regression models. Spread and total probabilities
+ are mathematically derived from score predictions using prediction uncertainty.
+
+ Formula:
+ For spreads: P(cover) = CDF((predicted_margin - spread) / stddev)
+ For totals: P(over) = CDF((predicted_total - line) / stddev)
+
+ Where:
+ stddev = sqrt(home_MAE^2 + away_MAE^2) = 7.6 points
+ CDF = Cumulative distribution function of standard normal distribution
+
+ Args:
+ features: Feature matrix
+ home_score_model: Trained home score regression model
+ away_score_model: Trained away score regression model
+ spreads_model: (Deprecated) Not used
+ totals_model: (Deprecated) Not used
+
+ Returns:
+ DataFrame with predictions
+ """
+ # Metadata columns
+ metadata_cols = [
+ "event_id",
+ "commence_time",
+ "home_team",
+ "away_team",
+ "favorite_team",
+ "underdog_team",
+ "spread_magnitude",
+ "total_points",
+ ]
+
+ # Prepare features (home/away features for score models)
+ X_totals = features[TOTALS_FEATURES].fillna(0.0)
+
+ # Predict scores using regression models
+ home_scores = home_score_model.predict(X_totals)
+ away_scores = away_score_model.predict(X_totals)
+
+ # Build results DataFrame
+ results = features[metadata_cols].copy()
+ results["predicted_home_score"] = home_scores.round(1)
+ results["predicted_away_score"] = away_scores.round(1)
+ results["predicted_margin"] = (home_scores - away_scores).round(1)
+ results["predicted_total"] = (home_scores + away_scores).round(1)
+
+ # Calculate spread probabilities from score predictions
+ # Determine effective margin (positive if favorite wins)
+ effective_margins = []
+ for _, row in results.iterrows():
+ margin = row["predicted_margin"]
+ if row["favorite_team"] == row["home_team"]:
+ # Home is favorite, margin is already correct (+ = home wins)
+ effective_margins.append(margin)
+ else:
+ # Away is favorite, flip margin sign (+ = away wins)
+ effective_margins.append(-margin)
+
+ effective_margins = pd.Series(effective_margins)
+
+ # Calculate cushion: how much better than spread
+ spread_cushions = effective_margins - results["spread_magnitude"]
+
+ # Convert to probability using normal distribution
+ z_scores_spread = spread_cushions / COMBINED_STDDEV
+ spread_proba = z_scores_spread.apply(stats.norm.cdf)
+
+ results["favorite_cover_prob"] = spread_proba
+ results["underdog_cover_prob"] = 1 - spread_proba
+ results["spread_edge"] = spread_proba - 0.524 # 0.524 = 110/(110+100)
+
+ # Calculate total probabilities from score predictions
+ total_cushions = results["predicted_total"] - results["total_points"]
+ z_scores_total = total_cushions / COMBINED_STDDEV
+ total_proba = z_scores_total.apply(stats.norm.cdf)
+
+ results["over_prob"] = total_proba
+ results["under_prob"] = 1 - total_proba
+ results["total_edge"] = total_proba - 0.524
+
+ return results
+
+
+def main() -> None:
+ """Generate predictions for today's games."""
+ parser = argparse.ArgumentParser(description="Generate daily NCAAB predictions")
+ parser.add_argument(
+ "--date",
+ type=str,
+ default=None,
+ help="Target date (YYYY-MM-DD, default: today)",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=None,
+ help="Output CSV path (default: data/outputs/predictions/YYYY-MM-DD.csv)",
+ )
+ parser.add_argument(
+ "--models-dir",
+ type=Path,
+ default=Path("models"),
+ help="Directory containing trained models",
+ )
+ parser.add_argument(
+ "--odds-dir",
+ type=Path,
+ default=Path("data/odds_api/daily"),
+ help="Directory containing daily odds files",
+ )
+ parser.add_argument(
+ "--kenpom-dir",
+ type=Path,
+ default=Path("data/kenpom"),
+ help="Directory containing KenPom data",
+ )
+ parser.add_argument(
+ "--odds-db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Odds API SQLite database for market/book features",
+ )
+ parser.add_argument(
+ "--include-market-features",
+ dest="include_market_features",
+ action="store_true",
+ default=True,
+ help="Include market (line movement) features",
+ )
+ parser.add_argument(
+ "--no-market-features",
+ dest="include_market_features",
+ action="store_false",
+ help="Disable market (line movement) features",
+ )
+ parser.add_argument(
+ "--include-bookmaker-features",
+ dest="include_bookmaker_features",
+ action="store_true",
+ default=True,
+ help="Include bookmaker divergence features",
+ )
+ parser.add_argument(
+ "--no-bookmaker-features",
+ dest="include_bookmaker_features",
+ action="store_false",
+ help="Disable bookmaker divergence features",
+ )
+ parser.add_argument(
+ "--season",
+ type=int,
+ default=2026,
+ help="KenPom season year",
+ )
+ parser.add_argument(
+ "--min-edge",
+ type=float,
+ default=0.05,
+ help="Minimum edge to highlight (default: 5%%)",
+ )
+ parser.add_argument(
+ "--use-tuned",
+ dest="use_tuned",
+ action="store_true",
+ default=True,
+ help="Use tuned models (default: True)",
+ )
+ parser.add_argument(
+ "--use-baseline",
+ dest="use_tuned",
+ action="store_false",
+ help="Use baseline models instead of tuned models",
+ )
+
+ args = parser.parse_args()
+
+ # Determine target date
+ target_date = date.fromisoformat(args.date) if args.date else date.today()
+ logger.info(f"Generating predictions for {target_date}")
+
+ # Determine output path
+ if args.output:
+ output_path = args.output
+ else:
+ output_dir = Path("predictions")
+ output_dir.mkdir(parents=True, exist_ok=True)
+ output_path = output_dir / f"{target_date.isoformat()}.csv"
+
+ try:
+ # Load score models (REQUIRED - primary prediction source)
+ home_score_model, away_score_model = load_score_models(args.models_dir)
+
+ # Load classification models (DEPRECATED - not used for predictions)
+ # Kept for backward compatibility only
+ try:
+ spreads_model, totals_model = load_models(args.models_dir, args.use_tuned)
+ logger.warning(
+ "Classification models loaded but NOT USED. "
+ "All predictions derived from score models."
+ )
+ except FileNotFoundError:
+ spreads_model, totals_model = None, None
+ logger.info("Classification models not found (not needed)")
+
+ # Load today's odds
+ spreads, totals = load_today_odds(args.odds_dir, target_date)
+
+ # Load KenPom data
+ kenpom_ratings, kenpom_ff, kenpom_height = load_kenpom_data(args.kenpom_dir, args.season)
+
+ # Load team mapper
+ try:
+ team_mapper = TeamMapper()
+ except FileNotFoundError:
+ logger.warning("Team mapping not found, using direct name matching")
+ team_mapper = None
+
+ # Build features
+ logger.info("Building prediction features...")
+ features = build_prediction_features(
+ spreads,
+ totals,
+ kenpom_ratings,
+ kenpom_ff,
+ kenpom_height,
+ team_mapper,
+ odds_db_path=args.odds_db,
+ include_market_features=args.include_market_features,
+ include_bookmaker_features=args.include_bookmaker_features,
+ )
+ logger.info(f"Built features for {len(features)} games")
+
+ # Make predictions (using score models only)
+ logger.info("Generating predictions from score models...")
+ predictions = make_predictions(
+ features,
+ home_score_model,
+ away_score_model,
+ spreads_model, # Not used
+ totals_model, # Not used
+ )
+
+ # Save predictions
+ write_csv(predictions, str(output_path), index=False)
+ logger.info(f"[OK] Saved predictions to {output_path}")
+
+ # Display predictions with edge
+ logger.info("\n" + "=" * 80)
+ logger.info(f"PREDICTIONS FOR {target_date}")
+ logger.info("=" * 80)
+
+ for _, game in predictions.iterrows():
+ game_time = pd.to_datetime(game["commence_time"]).strftime("%I:%M %p ET")
+ logger.info(f"\n{game['away_team']} @ {game['home_team']} ({game_time})")
+
+ # Score predictions (if available)
+ if "predicted_home_score" in game and pd.notna(game["predicted_home_score"]):
+ logger.info(
+ f" Predicted Score: {game['home_team']} {game['predicted_home_score']:.0f}, "
+ f"{game['away_team']} {game['predicted_away_score']:.0f}"
+ )
+ logger.info(
+ f" Predicted Margin: {abs(game['predicted_margin']):.1f} "
+ f"({'Home' if game['predicted_margin'] > 0 else 'Away'})"
+ )
+ logger.info(f" Predicted Total: {game['predicted_total']:.0f}")
+
+ # Spread predictions
+ logger.info(f" Spread: {game['favorite_team']} -{game['spread_magnitude']}")
+ logger.info(
+ f" Favorite cover prob: {game['favorite_cover_prob']:.1%} "
+ f"(edge: {game['spread_edge']:+.1%})"
+ )
+
+ # Total predictions
+ if game["total_points"] is not None:
+ logger.info(f" Total: {game['total_points']}")
+ logger.info(
+ f" Over prob: {game['over_prob']:.1%} (edge: {game['total_edge']:+.1%})"
+ )
+
+ # Highlight best edges
+ logger.info("\n" + "=" * 80)
+ logger.info("BEST OPPORTUNITIES")
+ logger.info("=" * 80)
+
+ # Best spread edges
+ spread_cols = [
+ "favorite_team",
+ "spread_magnitude",
+ "favorite_cover_prob",
+ "spread_edge",
+ ]
+ spread_edges = predictions[spread_cols].copy()
+ spread_edges = spread_edges[spread_edges["spread_edge"].abs() >= args.min_edge]
+ spread_edges = spread_edges.sort_values("spread_edge", ascending=False, key=abs)
+
+ if len(spread_edges) > 0:
+ logger.info("\nSpread opportunities:")
+ for _, edge in spread_edges.iterrows():
+ side = "Favorite" if edge["spread_edge"] > 0 else "Underdog"
+ logger.info(f" {side}: {edge['favorite_team']} (edge: {edge['spread_edge']:+.1%})")
+ else:
+ logger.info("\nNo significant spread edges found")
+
+ # Best total edges
+ total_cols = [
+ "home_team",
+ "away_team",
+ "total_points",
+ "over_prob",
+ "total_edge",
+ ]
+ total_edges = predictions[total_cols].copy()
+ total_edges = total_edges[total_edges["total_edge"].abs() >= args.min_edge]
+ total_edges = total_edges.sort_values("total_edge", ascending=False, key=abs)
+
+ if len(total_edges) > 0:
+ logger.info("\nTotal opportunities:")
+ for _, edge in total_edges.iterrows():
+ side = "Over" if edge["total_edge"] > 0 else "Under"
+ logger.info(
+ f" {side} {edge['total_points']}: "
+ f"{edge['home_team']} vs {edge['away_team']} "
+ f"(edge: {edge['total_edge']:+.1%})"
+ )
+ else:
+ logger.info("\nNo significant total edges found")
+
+ logger.info("\n" + "=" * 80)
+
+ except Exception as e:
+ logger.error(f"Prediction failed: {e}", exc_info=True)
+ raise SystemExit(1) from e
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/rebuild_team_mapping.py b/scripts/archive/2026-02-deprecated/rebuild_team_mapping.py
new file mode 100644
index 000000000..96bb7236e
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/rebuild_team_mapping.py
@@ -0,0 +1,638 @@
+"""Rebuild team name mapping with proper normalization.
+
+Handles common team name variations:
+1. St/St./Saint/State disambiguation
+2. Mascot name removal (ESPN adds mascots)
+3. Special character normalization (apostrophes, periods, hyphens)
+4. Abbreviation expansion (CSUN, CSU, etc.)
+5. Case-insensitive matching
+
+Usage:
+ uv run python scripts/rebuild_team_mapping.py
+
+ # Review matches before saving
+ uv run python scripts/rebuild_team_mapping.py --review
+
+ # Save to custom path
+ uv run python scripts/rebuild_team_mapping.py --output data/team_mapping_v2.parquet
+"""
+
+import argparse
+import json
+import logging
+import re
+import sqlite3
+from difflib import SequenceMatcher
+from pathlib import Path
+
+import pandas as pd
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def normalize_team_name(name: str, remove_mascot: bool = False) -> str:
+ """Normalize team name for matching.
+
+ Args:
+ name: Raw team name
+ remove_mascot: If True, remove common mascot words
+
+ Returns:
+ Normalized name (lowercase, no special chars, standardized abbreviations)
+ """
+ if not name:
+ return ""
+
+ # Convert to lowercase
+ normalized = name.lower()
+
+ # Remove possessive apostrophes: "st. john's" -> "st. johns"
+ normalized = normalized.replace("'s ", "s ").replace("'s", "s")
+
+ # Standardize St./Saint/State
+ # Rule: "st." or "st" at start of name = Saint, otherwise = State
+ # "st. john" -> "saint john", "kansas st" -> "kansas state"
+ if normalized.startswith("st. ") or normalized.startswith("st "):
+ normalized = "saint " + normalized[normalized.index(" ") + 1 :]
+ else:
+ # State abbreviations (not at start)
+ normalized = re.sub(r"\bst\.?\s", " state ", normalized)
+ normalized = re.sub(r"\bst\.?$", " state", normalized)
+
+ # Expand common abbreviations
+ abbreviations = {
+ r"\bcsun\b": "cal state northridge",
+ r"\bcsu\s": "cal state ",
+ r"\bcal\s": "california ",
+ r"\bunc\b": "north carolina",
+ r"\bunc\s": "north carolina ",
+ r"\bu\.?c\.?\s": "university of california ",
+ r"\bucf\b": "central florida",
+ r"\busc\b": "southern california",
+ r"\busf\b": "south florida",
+ r"\buab\b": "alabama birmingham",
+ r"\butep\b": "texas el paso",
+ r"\butsa\b": "texas san antonio",
+ r"\bfiu\b": "florida international",
+ r"\bliu\b": "long island",
+ r"\bvcu\b": "virginia commonwealth",
+ r"\bsmu\b": "southern methodist",
+ r"\btcu\b": "texas christian",
+ r"\bbyu\b": "brigham young",
+ r"\blsu\b": "louisiana state",
+ r"\bnjit\b": "new jersey tech",
+ r"\bsiue\b": "southern illinois edwardsville",
+ r"\bumbc\b": "maryland baltimore county",
+ r"\bumkc\b": "missouri kansas city",
+ r"\bul\s": "louisiana ",
+ r"\bgw\b": "george washington",
+ r"\bappalachian\b": "app",
+ }
+
+ for abbr, expansion in abbreviations.items():
+ normalized = re.sub(abbr, expansion, normalized)
+
+ # Remove "university" and "college" (redundant institutional words)
+ normalized = re.sub(r"\buniversity\b", "", normalized)
+ normalized = re.sub(r"\bcollege\b", "", normalized)
+ normalized = re.sub(r"\s+", " ", normalized).strip()
+
+ # Remove common mascot words (if requested)
+ if remove_mascot:
+ mascots = [
+ # Compound mascots (must come before single-word mascots)
+ "black knights",
+ "golden lions",
+ "golden eagles",
+ "red wolves",
+ "blue hens",
+ "great danes",
+ "sea wolves",
+ "river hawks",
+ "mountain hawks",
+ "rainbow warriors",
+ "nittany lions",
+ "crimson tide",
+ "fighting illini",
+ "tar heels",
+ "blue devils",
+ "golden gophers",
+ "scarlet knights",
+ "demon deacons",
+ "yellow jackets",
+ "red raiders",
+ "horned frogs",
+ "runnin rebels",
+ "fighting camels",
+ "golden griffins",
+ "purple aces",
+ "big red",
+ "big green",
+ "red foxes",
+ "thundering herd",
+ "black bears",
+ "blue demons",
+ # Common animal mascots
+ "wildcats",
+ "bulldogs",
+ "eagles",
+ "tigers",
+ "bears",
+ "cougars",
+ "panthers",
+ "hawks",
+ "huskies",
+ "aggies",
+ "cardinals",
+ "pirates",
+ "terriers",
+ "warriors",
+ "raiders",
+ "spartans",
+ "trojans",
+ "badgers",
+ "wolverines",
+ "bruins",
+ "ducks",
+ "buffaloes",
+ "utes",
+ "falcons",
+ "owls",
+ "lions",
+ "rams",
+ "broncos",
+ "mustangs",
+ "jaguars",
+ "leopards",
+ "bobcats",
+ "grizzlies",
+ "phoenix",
+ "seawolves",
+ "golden eagles",
+ "seahawks",
+ "redhawks",
+ "blackbirds",
+ "roadrunners",
+ "retrievers",
+ "greyhounds",
+ "peacocks",
+ "penguins",
+ "griffins",
+ "mastodons",
+ "bison",
+ "antelopes",
+ "jackrabbits",
+ "bearcats",
+ "braves",
+ "bulls",
+ "bluejays",
+ "dragons",
+ "dukes",
+ "gators",
+ "vandals",
+ "flames",
+ "lancers",
+ "jaspers",
+ "terrapins",
+ "ospreys",
+ "bengals",
+ "paladins",
+ # Unique/creative mascots
+ "volunteers",
+ "mountaineers",
+ "orange",
+ "hoosiers",
+ "jayhawks",
+ "sooners",
+ "buckeyes",
+ "boilermakers",
+ "cornhuskers",
+ "razorbacks",
+ "gamecocks",
+ "seminoles",
+ "cavaliers",
+ "hokies",
+ "hurricanes",
+ "longhorns",
+ "cyclones",
+ "cowboys",
+ "sun devils",
+ "matadors",
+ "revolutionaries",
+ "zips",
+ "flyers",
+ "crimson",
+ "minutemen",
+ "racers",
+ "miners",
+ "hilltoppers",
+ "governors",
+ "colonels",
+ "privateers",
+ "explorers",
+ "navigators",
+ "crusaders",
+ "friars",
+ "musketeers",
+ "raiders",
+ "runnin rebels",
+ "rebels",
+ "ramblers",
+ "redbirds",
+ "river hawks",
+ "royals",
+ "statesmen",
+ "vikings",
+ "gaels",
+ "saints",
+ "pioneers",
+ "knights",
+ "49ers",
+ "lightning",
+ "thunder",
+ "fire",
+ "pride",
+ "lumberjacks",
+ "bearkats",
+ "chanticleers",
+ "catamounts",
+ "blazers",
+ "wave",
+ "rainbow warriors",
+ ]
+
+ for mascot in mascots:
+ normalized = normalized.replace(f" {mascot}", "")
+
+ # Remove special characters and extra spaces
+ normalized = re.sub(r"[^a-z0-9\s]", " ", normalized)
+ normalized = re.sub(r"\s+", " ", normalized).strip()
+
+ return normalized
+
+
+def similarity_score(name1: str, name2: str) -> float:
+ """Calculate similarity between two team names.
+
+ Args:
+ name1: First team name (normalized)
+ name2: Second team name (normalized)
+
+ Returns:
+ Similarity score (0.0 to 1.0)
+ """
+ return SequenceMatcher(None, name1, name2).ratio()
+
+
+def find_best_match(
+ target_name: str,
+ candidates: list[tuple[str, str]],
+ threshold: float = 0.75,
+) -> tuple[str, float] | None:
+ """Find best matching team name from candidates.
+
+ Args:
+ target_name: Team name to match (from source A)
+ candidates: List of (original_name, normalized_name) tuples (from source B)
+ threshold: Minimum similarity score to accept
+
+ Returns:
+ (matched_name, score) or None if no good match
+ """
+ target_normalized = normalize_team_name(target_name)
+ target_no_mascot = normalize_team_name(target_name, remove_mascot=True)
+
+ best_match = None
+ best_score = 0.0
+
+ for original, normalized in candidates:
+ # Try exact match first
+ if target_normalized == normalized:
+ return (original, 1.0)
+
+ # Try without mascot
+ normalized_no_mascot = normalize_team_name(original, remove_mascot=True)
+ if target_no_mascot == normalized_no_mascot:
+ return (original, 0.95)
+
+ # Calculate similarity
+ score1 = similarity_score(target_normalized, normalized)
+ score2 = similarity_score(target_no_mascot, normalized_no_mascot)
+ score = max(score1, score2)
+
+ if score > best_score:
+ best_score = score
+ best_match = original
+
+ if best_score >= threshold:
+ return (best_match, best_score)
+
+ return None
+
+
+def load_kenpom_teams(kenpom_path: Path) -> pd.DataFrame:
+ """Load KenPom team names.
+
+ Args:
+ kenpom_path: Path to KenPom data directory
+
+ Returns:
+ DataFrame with TeamName column
+ """
+ ratings_file = kenpom_path / "ratings" / "season" / "ratings_2026.parquet"
+ if not ratings_file.exists():
+ raise FileNotFoundError(f"KenPom ratings not found: {ratings_file}")
+
+ df = pd.read_parquet(ratings_file)
+ return df[["TeamName"]].drop_duplicates().sort_values("TeamName")
+
+
+def load_odds_api_teams(db_path: Path) -> pd.DataFrame:
+ """Load Odds API team names from database.
+
+ Args:
+ db_path: Path to Odds API database
+
+ Returns:
+ DataFrame with team_name column
+ """
+ conn = sqlite3.connect(str(db_path))
+ try:
+ query = """
+ SELECT DISTINCT home_team as team_name FROM events
+ UNION
+ SELECT DISTINCT away_team as team_name FROM events
+ ORDER BY team_name
+ """
+ return pd.read_sql_query(query, conn)
+ finally:
+ conn.close()
+
+
+def load_espn_teams(team_logos_dir: Path) -> pd.DataFrame:
+ """Load ESPN team names from team logo filenames.
+
+ Args:
+ team_logos_dir: Path to ESPN team logos directory
+
+ Returns:
+ DataFrame with espn_name column
+ """
+ if not team_logos_dir.exists():
+ logger.warning(f"ESPN team logos directory not found: {team_logos_dir}")
+ return pd.DataFrame(columns=["espn_name"])
+
+ # Extract team names from logo filenames
+ # Format: "{team-slug}-{mascot}.png" -> "Team Name Mascot"
+ espn_names = []
+ for logo_file in team_logos_dir.glob("*.png"):
+ # Remove .png extension
+ filename = logo_file.stem
+
+ # Convert kebab-case to Title Case
+ # "abilene-christian-wildcats" -> "Abilene Christian Wildcats"
+ team_name = filename.replace("-", " ").title()
+ espn_names.append(team_name)
+
+ logger.info(f" Extracted {len(espn_names)} team names from logo filenames")
+ return pd.DataFrame({"espn_name": sorted(espn_names)})
+
+
+def load_manual_overrides(overrides_path: Path) -> dict[str, dict[str, str]]:
+ """Load manual team name mapping overrides.
+
+ Args:
+ overrides_path: Path to JSON file with manual overrides
+
+ Returns:
+ Dictionary mapping kenpom_name to {odds_api_name, espn_name}
+ """
+ if not overrides_path.exists():
+ logger.warning(f"No overrides file found at {overrides_path}")
+ return {}
+
+ with open(overrides_path, encoding="utf-8") as f:
+ data = json.load(f)
+
+ overrides = data.get("overrides", {})
+ logger.info(f"Loaded {len(overrides)} manual overrides")
+ return overrides
+
+
+def build_team_mapping(
+ kenpom_df: pd.DataFrame,
+ odds_api_df: pd.DataFrame,
+ espn_df: pd.DataFrame,
+ manual_overrides: dict[str, dict[str, str]] | None = None,
+ threshold: float = 0.75,
+) -> pd.DataFrame:
+ """Build comprehensive team name mapping.
+
+ Args:
+ kenpom_df: KenPom teams (TeamName column)
+ odds_api_df: Odds API teams (team_name column)
+ espn_df: ESPN teams (espn_name column)
+ manual_overrides: Optional manual mappings to apply first
+ threshold: Minimum similarity score for matching
+
+ Returns:
+ DataFrame with kenpom_name, odds_api_name, espn_name, match scores
+ """
+ kenpom_teams = kenpom_df["TeamName"].tolist()
+
+ # Prepare candidate lists with normalized names
+ odds_candidates = [
+ (name, normalize_team_name(name)) for name in odds_api_df["team_name"].tolist()
+ ]
+ espn_candidates = [(name, normalize_team_name(name)) for name in espn_df["espn_name"].tolist()]
+
+ if manual_overrides is None:
+ manual_overrides = {}
+
+ mappings = []
+
+ for kenpom_name in kenpom_teams:
+ logger.debug(f"Matching: {kenpom_name}")
+
+ # Check manual overrides first
+ if kenpom_name in manual_overrides:
+ override = manual_overrides[kenpom_name]
+ odds_name = override.get("odds_api_name", "")
+ espn_name = override.get("espn_name", "")
+ odds_score = 1.0 if odds_name else 0.0
+ espn_score = 1.0 if espn_name else 0.0
+ logger.debug(f" Applied manual override: odds={odds_name}, espn={espn_name}")
+ else:
+ # Find best Odds API match
+ odds_match = find_best_match(kenpom_name, odds_candidates, threshold)
+ odds_name = odds_match[0] if odds_match else ""
+ odds_score = odds_match[1] if odds_match else 0.0
+
+ # Find best ESPN match
+ espn_match = find_best_match(kenpom_name, espn_candidates, threshold)
+ espn_name = espn_match[0] if espn_match else ""
+ espn_score = espn_match[1] if espn_match else 0.0
+
+ mappings.append(
+ {
+ "kenpom_name": kenpom_name,
+ "odds_api_name": odds_name,
+ "odds_api_score": round(odds_score, 3),
+ "espn_name": espn_name,
+ "espn_score": round(espn_score, 3),
+ }
+ )
+
+ if odds_name:
+ logger.debug(f" Odds API: {odds_name} (score: {odds_score:.3f})")
+ if espn_name:
+ logger.debug(f" ESPN: {espn_name} (score: {espn_score:.3f})")
+
+ return pd.DataFrame(mappings)
+
+
+def main() -> None:
+ """Rebuild team mapping with proper normalization."""
+ parser = argparse.ArgumentParser(description="Rebuild team name mapping")
+ parser.add_argument(
+ "--kenpom-path",
+ type=Path,
+ default=Path("data/kenpom"),
+ help="Path to KenPom data directory",
+ )
+ parser.add_argument(
+ "--odds-db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to Odds API database",
+ )
+ parser.add_argument(
+ "--espn-logos",
+ type=Path,
+ default=Path("data/espn/team_logos"),
+ help="Path to ESPN team logos directory",
+ )
+ parser.add_argument(
+ "--overrides",
+ type=Path,
+ default=Path("data/staging/mappings/team_mapping_overrides.json"),
+ help="Path to manual overrides JSON file",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=Path("data/team_mapping_fixed.parquet"),
+ help="Output path for new mapping",
+ )
+ parser.add_argument(
+ "--threshold",
+ type=float,
+ default=0.75,
+ help="Minimum similarity score for matching (0.0-1.0)",
+ )
+ parser.add_argument(
+ "--review",
+ action="store_true",
+ help="Review matches before saving",
+ )
+
+ args = parser.parse_args()
+
+ logger.info("=" * 80)
+ logger.info("REBUILDING TEAM NAME MAPPING")
+ logger.info("=" * 80)
+
+ # Load team lists
+ logger.info("\n[1/4] Loading team names from sources...")
+ kenpom_df = load_kenpom_teams(args.kenpom_path)
+ logger.info(f" KenPom: {len(kenpom_df)} teams")
+
+ odds_api_df = load_odds_api_teams(args.odds_db)
+ logger.info(f" Odds API: {len(odds_api_df)} teams")
+
+ espn_df = load_espn_teams(args.espn_logos)
+ logger.info(f" ESPN: {len(espn_df)} teams")
+
+ # Load manual overrides
+ logger.info("\n[2/5] Loading manual overrides...")
+ manual_overrides = load_manual_overrides(args.overrides)
+
+ # Build mapping
+ logger.info(f"\n[3/5] Building mappings (threshold: {args.threshold})...")
+ mapping_df = build_team_mapping(
+ kenpom_df, odds_api_df, espn_df, manual_overrides, args.threshold
+ )
+
+ # Statistics
+ logger.info("\n[4/5] Mapping statistics:")
+ odds_matched = (mapping_df["odds_api_name"] != "").sum()
+ espn_matched = (mapping_df["espn_name"] != "").sum()
+
+ odds_pct = odds_matched / len(mapping_df) * 100
+ espn_pct = espn_matched / len(mapping_df) * 100
+ logger.info(f" Odds API matches: {odds_matched}/{len(mapping_df)} ({odds_pct:.1f}%)")
+ logger.info(f" ESPN matches: {espn_matched}/{len(mapping_df)} ({espn_pct:.1f}%)")
+
+ # Show unmatched teams
+ odds_unmatched = mapping_df[mapping_df["odds_api_name"] == ""]
+ if len(odds_unmatched) > 0:
+ logger.warning(f"\n Unmatched Odds API teams ({len(odds_unmatched)}):")
+ for _, row in odds_unmatched.head(10).iterrows():
+ logger.warning(f" - {row['kenpom_name']}")
+
+ espn_unmatched = mapping_df[mapping_df["espn_name"] == ""]
+ if len(espn_unmatched) > 0:
+ logger.warning(f"\n Unmatched ESPN teams ({len(espn_unmatched)}):")
+ for _, row in espn_unmatched.head(10).iterrows():
+ logger.warning(f" - {row['kenpom_name']}")
+
+ # Review mode
+ if args.review:
+ logger.info("\n[REVIEW MODE] Sample of low-confidence matches:")
+ low_confidence = mapping_df[
+ ((mapping_df["odds_api_score"] > 0) & (mapping_df["odds_api_score"] < 0.9))
+ | ((mapping_df["espn_score"] > 0) & (mapping_df["espn_score"] < 0.9))
+ ]
+
+ for _, row in low_confidence.head(20).iterrows():
+ logger.info(f"\n KenPom: {row['kenpom_name']}")
+ if row["odds_api_name"]:
+ logger.info(
+ f" → Odds API: {row['odds_api_name']} (score: {row['odds_api_score']})"
+ )
+ if row["espn_name"]:
+ logger.info(f" → ESPN: {row['espn_name']} (score: {row['espn_score']})")
+
+ confirm = input("\nProceed with saving? (y/n): ").strip().lower()
+ if confirm != "y":
+ logger.info("Aborted by user")
+ return
+
+ # Save
+ logger.info(f"\n[5/5] Saving mapping to {args.output}...")
+ args.output.parent.mkdir(parents=True, exist_ok=True)
+
+ # Backup old mapping if it exists
+ if args.output.exists():
+ backup_path = args.output.with_suffix(".backup.parquet")
+ import shutil
+
+ shutil.copy(args.output, backup_path)
+ logger.info(f" Backed up old mapping to {backup_path}")
+
+ mapping_df.to_parquet(args.output, index=False)
+
+ logger.info("\n" + "=" * 80)
+ logger.info("[OK] Team mapping rebuilt successfully!")
+ logger.info(f"Saved to: {args.output}")
+ logger.info(f"Total teams: {len(mapping_df)}")
+ odds_coverage_pct = odds_matched / len(mapping_df) * 100
+ espn_coverage_pct = espn_matched / len(mapping_df) * 100
+ logger.info(f"Odds API coverage: {odds_matched}/{len(mapping_df)} ({odds_coverage_pct:.1f}%)")
+ logger.info(f"ESPN coverage: {espn_matched}/{len(mapping_df)} ({espn_coverage_pct:.1f}%)")
+ logger.info("=" * 80)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/save_overtime_snapshot.py b/scripts/archive/2026-02-deprecated/save_overtime_snapshot.py
new file mode 100644
index 000000000..cd895d83f
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/save_overtime_snapshot.py
@@ -0,0 +1,133 @@
+"""Save overtime.ag snapshot from JSON data.
+
+This script takes JSON data (extracted manually or via browser automation)
+and saves it to parquet files.
+
+Usage:
+ uv run python scripts/save_overtime_snapshot.py
+
+Example JSON structure:
+{
+ "account_balance": {
+ "balance": "1821.01",
+ "credit_limit": "10000.00",
+ "pending": "800.00",
+ "available_balance": "11021.01",
+ "casino_balance": "0.00"
+ },
+ "open_bets": [
+ {
+ "ticket_number": "133002387-1",
+ "date": "MON 2/2",
+ "time": "7:17 PM",
+ "bet_type": "Spread",
+ "details": "Basketball - 306513 Houston Christian +9½ -120 for Game",
+ "risk": "120.00",
+ "to_win": "100.00"
+ }
+ ],
+ "daily_figures": {
+ "current_week": {
+ "starting_date": "02/02/2026",
+ "starting_balance": "1941.01",
+ "monday": "-120.00",
+ "tuesday": "0.00",
+ "wednesday": "0.00",
+ "thursday": "0.00",
+ "friday": "0.00",
+ "saturday": "0.00",
+ "sunday": "0.00",
+ "week_total": "-120.00",
+ "payments": "0.00",
+ "balance": "1821.01"
+ },
+ "last_week": { ... }
+ }
+}
+"""
+
+import asyncio
+import json
+import logging
+import sys
+from pathlib import Path
+
+# Add src to path
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+
+from sports_betting_edge.adapters.overtime_scraper import OvertimeScraperAdapter
+from sports_betting_edge.config.logging import configure_logging
+from sports_betting_edge.services.overtime_tracker import OvertimeTrackerService
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+async def main():
+ """Main execution function."""
+ if len(sys.argv) < 2:
+ print("Usage: uv run python scripts/save_overtime_snapshot.py ")
+ print("\nExample:")
+ print(" uv run python scripts/save_overtime_snapshot.py data/overtime_snapshot.json")
+ sys.exit(1)
+
+ json_file = Path(sys.argv[1])
+
+ if not json_file.exists():
+ print(f"[ERROR] File not found: {json_file}")
+ sys.exit(1)
+
+ logger.info(f"Loading data from {json_file}")
+
+ with open(json_file) as f:
+ data = json.load(f)
+
+ # Setup service
+ project_root = Path(__file__).parent.parent
+ data_dir = project_root / "data" / "overtime"
+
+ class DummyClient:
+ pass
+
+ scraper = OvertimeScraperAdapter(DummyClient())
+ service = OvertimeTrackerService(scraper, data_dir)
+
+ # Parse the JSON data using the scraper adapter
+ logger.info("Parsing data...")
+
+ balance_data = {
+ "balance": f"${data['account_balance']['balance']}",
+ "credit_limit": f"${data['account_balance']['credit_limit']}",
+ "pending": f"${data['account_balance']['pending']}",
+ "available_balance": f"${data['account_balance']['available_balance']}",
+ "casino_balance": f"${data['account_balance'].get('casino_balance', '0.00')}",
+ }
+
+ open_bets_data = data.get("open_bets", [])
+ daily_figures_data = data.get("daily_figures", {})
+
+ # Create snapshot
+ snapshot = await scraper.scrape_full_snapshot(
+ balance_data,
+ open_bets_data,
+ daily_figures_data,
+ )
+
+ # Save snapshot
+ logger.info("Saving to parquet files...")
+ saved_paths = await service.save_full_snapshot(snapshot)
+
+ print("\n" + "=" * 80)
+ print("[OK] SNAPSHOT SAVED SUCCESSFULLY")
+ print("=" * 80)
+ print(f"\nAccount Balance: ${snapshot.account_balance.balance}")
+ print(f"Open Bets: {len(snapshot.open_bets.bets)} totaling ${snapshot.open_bets.total_risk}")
+ print(f"Current Week P&L: ${snapshot.daily_figures.current_week.week_total}")
+ print("\nFiles saved:")
+ for snap_type, path in saved_paths.items():
+ print(f" - {snap_type}: {path}")
+ print("=" * 80)
+
+
+if __name__ == "__main__":
+ asyncio.run(main())
diff --git a/scripts/archive/2026-02-deprecated/test_simple.py b/scripts/archive/2026-02-deprecated/test_simple.py
new file mode 100644
index 000000000..67f4ca610
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/test_simple.py
@@ -0,0 +1,99 @@
+"""Simple test of Odds API database without complex views."""
+
+import sqlite3
+
+import pandas as pd
+
+# Connect fresh
+conn = sqlite3.connect("data/odds_api/odds_api.sqlite3")
+
+# Test 1: Query raw observations
+print("=== Test 1: Raw Observations ===")
+result = conn.execute("""
+ SELECT COUNT(*) as total,
+ COUNT(DISTINCT event_id) as games,
+ MIN(as_of) as first,
+ MAX(as_of) as last
+ FROM observations
+""").fetchone()
+
+print(f"Total observations: {result[0]:,}")
+print(f"Unique games: {result[1]:,}")
+print(f"Date range: {result[2]} to {result[3]}")
+
+# Test 2: Manual canonical spreads (no view)
+print("\n=== Test 2: Manual Canonical Spreads ===")
+spreads = pd.read_sql(
+ """
+ SELECT
+ event_id,
+ book_key,
+ as_of,
+ MAX(CASE WHEN point < 0 THEN outcome_name END) as favorite_team,
+ MAX(CASE WHEN point > 0 THEN outcome_name END) as underdog_team,
+ ABS(MAX(point)) as spread_magnitude,
+ MAX(CASE WHEN point < 0 THEN price_american END) as favorite_price,
+ MAX(CASE WHEN point > 0 THEN price_american END) as underdog_price
+ FROM observations
+ WHERE market_key = 'spreads'
+ AND point IS NOT NULL
+ GROUP BY event_id, book_key, as_of, ABS(point)
+ LIMIT 5
+""",
+ conn,
+)
+
+print(f"Sample canonical spreads ({len(spreads)} rows):")
+spread_cols = ["event_id", "favorite_team", "underdog_team", "spread_magnitude"]
+print(spreads[spread_cols].to_string(index=False))
+
+# Test 3: Events with scores
+print("\n=== Test 3: Events with Scores ===")
+events = pd.read_sql(
+ """
+ SELECT e.event_id, e.home_team, e.away_team, s.home_score, s.away_score
+ FROM events e
+ INNER JOIN scores s ON e.event_id = s.event_id
+ WHERE s.home_score IS NOT NULL
+ LIMIT 5
+""",
+ conn,
+)
+
+print(f"Found {len(events)} games with scores:")
+for _, row in events.iterrows():
+ away = row["away_team"]
+ home = row["home_team"]
+ score = f"{row['away_score']}-{row['home_score']}"
+ print(f" {away:30s} @ {home:30s} ({score})")
+
+# Test 4: Line movement (manual)
+print("\n=== Test 4: Line Movement for One Game ===")
+sample_event = events.iloc[0]["event_id"]
+
+movements = pd.read_sql(
+ f"""
+ SELECT
+ as_of,
+ book_key,
+ MAX(CASE WHEN point < 0 THEN outcome_name END) as favorite,
+ ABS(MAX(point)) as spread
+ FROM observations
+ WHERE event_id = '{sample_event}'
+ AND market_key = 'spreads'
+ AND book_key = 'fanduel'
+ GROUP BY as_of, ABS(point)
+ ORDER BY as_of
+""",
+ conn,
+)
+
+if len(movements) > 0:
+ print(f"FanDuel line movement for {sample_event[:20]}...:")
+ print(f" Opening: {movements.iloc[0]['favorite']} {movements.iloc[0]['spread']}")
+ print(f" Closing: {movements.iloc[-1]['favorite']} {movements.iloc[-1]['spread']}")
+ print(f" Total observations: {len(movements)}")
+
+conn.close()
+
+print("\n[OK] Simple tests passed!")
diff --git a/scripts/archive/2026-02-deprecated/track_overtime_daily.py b/scripts/archive/2026-02-deprecated/track_overtime_daily.py
new file mode 100644
index 000000000..8390fa5a2
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/track_overtime_daily.py
@@ -0,0 +1,238 @@
+"""Script to track overtime.ag betting data daily.
+
+This script uses chrome-devtools MCP to scrape betting data from overtime.ag
+and save it to parquet files for analysis.
+
+Usage:
+ uv run python scripts/track_overtime_daily.py
+"""
+
+import asyncio
+import logging
+import sys
+from pathlib import Path
+
+# Add src to path
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+class ChromeDevToolsClient:
+ """Wrapper for chrome-devtools MCP operations."""
+
+ def __init__(self, mcp_tools: dict):
+ """Initialize with MCP tools dictionary."""
+ self.tools = mcp_tools
+
+ async def navigate(self, url: str) -> dict:
+ """Navigate to a URL."""
+ return await self.tools["navigate_page"](url=url, type="url")
+
+ async def take_snapshot(self) -> dict:
+ """Take a page snapshot."""
+ return await self.tools["take_snapshot"]()
+
+ async def click(self, uid: str) -> dict:
+ """Click an element."""
+ return await self.tools["click"](uid=uid, includeSnapshot=True)
+
+ async def wait_for(self, text: str, timeout: int = 5000) -> dict:
+ """Wait for text to appear."""
+ return await self.tools["wait_for"](text=text, timeout=timeout)
+
+
+async def extract_account_balance_from_menu(snapshot: dict) -> dict[str, str]:
+ """Extract account balance from dropdown menu snapshot.
+
+ Args:
+ snapshot: Page snapshot dict with menu visible
+
+ Returns:
+ Dict with balance, credit_limit, pending, available_balance, casino_balance
+ """
+ # Parse the snapshot text to find balance information
+ # This is a simplified version - in production would parse the actual snapshot structure
+ balance_data = {}
+
+ # Look for patterns like "BALANCE", "$1,821.01", etc.
+ # The snapshot should have these as static text elements in sequence
+ text_elements = []
+
+ def extract_text(node: dict):
+ if isinstance(node, dict):
+ if "StaticText" in node.get("role", "") or node.get("name"):
+ text_elements.append(node.get("name", ""))
+ for child in node.get("children", []):
+ extract_text(child)
+
+ extract_text(snapshot)
+
+ # Find balance values
+ for i, text in enumerate(text_elements):
+ if text == "BALANCE" and i + 1 < len(text_elements):
+ balance_data["balance"] = text_elements[i + 1]
+ elif text == "CR. LIMIT" and i + 1 < len(text_elements):
+ balance_data["credit_limit"] = text_elements[i + 1]
+ elif text == "PENDING" and i + 1 < len(text_elements):
+ balance_data["pending"] = text_elements[i + 1]
+ elif text == "AVAIL BAL" and i + 1 < len(text_elements):
+ balance_data["available_balance"] = text_elements[i + 1]
+ elif text == "NP CASINO" and i + 1 < len(text_elements):
+ balance_data["casino_balance"] = text_elements[i + 1]
+
+ return balance_data
+
+
+async def extract_open_bets_from_snapshot(snapshot: dict) -> list[dict]:
+ """Extract open bets from open bets page snapshot.
+
+ Args:
+ snapshot: Page snapshot dict from open bets page
+
+ Returns:
+ List of bet dicts with ticket_number, date, time, bet_type, details, risk, to_win
+ """
+ # This is a simplified parser - would need to properly parse the snapshot structure
+ # For now, returning mock structure that matches what we saw in the browser
+ bets = []
+
+ # The snapshot should have a table structure with headers:
+ # TIK#, ACCEPTED DATE, TYPE, DETAILS, RISK, WIN
+ # We need to parse rows of data
+
+ # In a real implementation, we'd walk the snapshot tree to find table rows
+ # For now, this is a placeholder that shows the expected structure
+ logger.warning(
+ "Open bets extraction requires manual implementation based on snapshot structure"
+ )
+
+ return bets
+
+
+async def extract_daily_figures_from_snapshot(snapshot: dict) -> dict:
+ """Extract daily figures from daily figures page snapshot.
+
+ Args:
+ snapshot: Page snapshot dict from daily figures page
+
+ Returns:
+ Dict with current_week, last_week, past_weeks data
+ """
+ # This is a simplified parser - would need to properly parse the snapshot structure
+ figures_data = {
+ "current_week": {},
+ "last_week": {},
+ "past_weeks": [],
+ }
+
+ logger.warning(
+ "Daily figures extraction requires manual implementation based on snapshot structure"
+ )
+
+ return figures_data
+
+
+async def scrape_overtime_data(
+ chrome_client: ChromeDevToolsClient,
+) -> tuple[dict, list[dict], dict]:
+ """Scrape all overtime.ag data.
+
+ Args:
+ chrome_client: Chrome DevTools MCP client
+
+ Returns:
+ Tuple of (balance_data, open_bets_data, daily_figures_data)
+ """
+ logger.info("Starting overtime.ag data scrape")
+
+ # Navigate to main page
+ logger.info("Navigating to overtime.ag")
+ await chrome_client.navigate("https://overtime.ag/sports#/")
+ await asyncio.sleep(2) # Wait for page load
+
+ # Click user menu to get balance
+ logger.info("Extracting account balance")
+ _snapshot = await chrome_client.take_snapshot()
+
+ # Find user menu button (this will vary - need to adapt based on actual snapshot)
+ # For now, using a placeholder UID
+ # await chrome_client.click("user_menu_uid")
+ # balance_snapshot = await chrome_client.take_snapshot()
+ # balance_data = await extract_account_balance_from_menu(balance_snapshot)
+
+ # Placeholder balance data
+ balance_data = {
+ "balance": "$1,821.01",
+ "credit_limit": "$10,000.00",
+ "pending": "$800.00",
+ "available_balance": "$11,021.01",
+ "casino_balance": "$0.00",
+ }
+
+ # Navigate to open bets
+ logger.info("Navigating to open bets")
+ await chrome_client.navigate("https://overtime.ag/sports#/openBets")
+ await chrome_client.wait_for("Open Bets")
+ await asyncio.sleep(2)
+
+ open_bets_snapshot = await chrome_client.take_snapshot()
+ open_bets_data = await extract_open_bets_from_snapshot(open_bets_snapshot)
+
+ # Navigate to daily figures
+ logger.info("Navigating to daily figures")
+ await chrome_client.navigate("https://overtime.ag/sports#/dailyFigures")
+ await chrome_client.wait_for("DAILY FIGURES")
+ await asyncio.sleep(2)
+
+ daily_figures_snapshot = await chrome_client.take_snapshot()
+ daily_figures_data = await extract_daily_figures_from_snapshot(daily_figures_snapshot)
+
+ logger.info("Data scrape completed")
+ return balance_data, open_bets_data, daily_figures_data
+
+
+async def main():
+ """Main execution function."""
+ logger.info("Starting overtime.ag daily tracking script")
+
+ # Setup data directory
+ project_root = Path(__file__).parent.parent
+ data_dir = project_root / "data" / "overtime"
+
+ # Note: This script requires the chrome-devtools MCP server to be running
+ # and a browser session already logged into overtime.ag
+
+ logger.info("=" * 80)
+ logger.info("OVERTIME.AG DAILY TRACKER")
+ logger.info("=" * 80)
+ logger.info("")
+ logger.info("[IMPORTANT] Before running this script:")
+ logger.info("1. Ensure chrome-devtools MCP server is running")
+ logger.info("2. Open Chrome and log into overtime.ag")
+ logger.info("3. Keep the browser window open during execution")
+ logger.info("")
+ logger.info(f"Data will be saved to: {data_dir}")
+ logger.info("")
+ logger.info("=" * 80)
+
+ # For now, print instructions for manual data entry
+ # In production, this would connect to the MCP server
+ print("\n[INFO] This script is currently in development mode.")
+ print("[INFO] To complete the implementation:")
+ print("")
+ print("1. Implement MCP client connection")
+ print("2. Complete snapshot parsing functions")
+ print("3. Test with live browser session")
+ print("")
+ print("For now, you can manually capture data using the browser automation")
+ print("we demonstrated earlier, and I'll help build the parser functions.")
+
+ logger.info("Script setup complete - awaiting full MCP integration")
+
+
+if __name__ == "__main__":
+ asyncio.run(main())
diff --git a/scripts/archive/2026-02-deprecated/train_spreads_basic_model.py b/scripts/archive/2026-02-deprecated/train_spreads_basic_model.py
new file mode 100644
index 000000000..2db30a723
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/train_spreads_basic_model.py
@@ -0,0 +1,257 @@
+"""Train XGBoost model for spreads (favorite cover) prediction.
+
+Usage:
+ uv run python scripts/train_spreads_basic_model.py \
+ --data data/ml/spreads_2025-12-01_2026-02-03.parquet
+"""
+
+import argparse
+import logging
+from pathlib import Path
+
+import pandas as pd
+import xgboost as xgb
+from sklearn.metrics import accuracy_score, log_loss, roc_auc_score
+from sklearn.model_selection import train_test_split
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+BASIC_SPREADS_FEATURES = [
+ "opening_spread",
+ "closing_spread",
+ "line_movement",
+ "fav_adj_em",
+ "fav_adj_o",
+ "fav_adj_d",
+ "fav_adj_t",
+ "fav_luck",
+ "fav_sos",
+ "fav_efg_pct",
+ "fav_to_pct",
+ "fav_or_pct",
+ "fav_ft_rate",
+ "fav_defg_pct",
+ "fav_dto_pct",
+ "dog_adj_em",
+ "dog_adj_o",
+ "dog_adj_d",
+ "dog_adj_t",
+ "dog_luck",
+ "dog_sos",
+ "dog_efg_pct",
+ "dog_to_pct",
+ "dog_or_pct",
+ "dog_ft_rate",
+ "dog_defg_pct",
+ "dog_dto_pct",
+ "em_diff",
+ "fav_o_vs_dog_d",
+ "dog_o_vs_fav_d",
+]
+
+FULL_SPREADS_FEATURES = [
+ *BASIC_SPREADS_FEATURES,
+ "pinnacle_closing_spread",
+ "fanduel_closing_spread",
+ "draftkings_closing_spread",
+ "betmgm_closing_spread",
+ "total_steam_moves",
+ "max_steam_move",
+ "steam_move_direction",
+ "movement_velocity",
+ "hours_tracked",
+ "avg_observations_per_hour",
+ "late_movement_points",
+ "late_movement_flag",
+ "late_movement_pct",
+ "sharp_public_split",
+ "pinnacle_movement",
+ "public_movement",
+ "reverse_line_movement",
+ "consensus_spread",
+ "spread_variance",
+ "has_market_disagreement",
+ "outlier_book_count",
+ "near_key_number",
+ "closest_key_number",
+]
+
+
+def load_dataset(data_path: Path, feature_set: str) -> tuple[pd.DataFrame, pd.Series]:
+ """Load training dataset from parquet.
+
+ Args:
+ data_path: Path to parquet file with features and target
+ feature_set: "basic" or "full"
+
+ Returns:
+ (X, y) where X is features and y is target
+ """
+ logger.info("Loading dataset from %s...", data_path)
+ df = pd.read_parquet(data_path)
+
+ if "target" not in df.columns:
+ raise ValueError("Dataset missing required 'target' column")
+
+ features = BASIC_SPREADS_FEATURES if feature_set == "basic" else FULL_SPREADS_FEATURES
+ missing = [col for col in features if col not in df.columns]
+ if missing:
+ missing_str = ", ".join(missing)
+ raise ValueError(f"Dataset missing required spreads features: {missing_str}")
+
+ # Separate features and target
+ y = df["target"]
+ X = df[features]
+
+ # Fill NaN with 0 (missing KenPom data)
+ X = X.fillna(0)
+
+ logger.info("Loaded %d samples, %d features", len(X), len(X.columns))
+ logger.info("Target distribution: %s", y.value_counts().to_dict())
+
+ return X, y
+
+
+def train_model(
+ X: pd.DataFrame,
+ y: pd.Series,
+ test_size: float = 0.2,
+ random_state: int = 42,
+) -> tuple[xgb.XGBClassifier, dict]:
+ """Train XGBoost classifier with train/test split.
+
+ Args:
+ X: Feature matrix
+ y: Target labels (1=favorite covered, 0=favorite did not cover)
+ test_size: Proportion of data for testing
+ random_state: Random seed for reproducibility
+
+ Returns:
+ (trained_model, metrics_dict)
+ """
+ logger.info("Splitting data: %.1f%% test set...", test_size * 100)
+
+ X_train, X_test, y_train, y_test = train_test_split(
+ X, y, test_size=test_size, random_state=random_state, stratify=y
+ )
+
+ logger.info("Train set: %d samples", len(X_train))
+ logger.info("Test set: %d samples", len(X_test))
+
+ # Train XGBoost classifier
+ logger.info("Training XGBoost classifier...")
+ model = xgb.XGBClassifier(
+ n_estimators=100,
+ max_depth=5,
+ learning_rate=0.1,
+ objective="binary:logistic",
+ eval_metric="logloss",
+ random_state=random_state,
+ )
+
+ model.fit(
+ X_train,
+ y_train,
+ eval_set=[(X_test, y_test)],
+ verbose=False,
+ )
+
+ # Evaluate
+ logger.info("Evaluating model...")
+ y_train_pred = model.predict(X_train)
+ y_test_pred = model.predict(X_test)
+ y_train_proba = model.predict_proba(X_train)[:, 1]
+ y_test_proba = model.predict_proba(X_test)[:, 1]
+
+ metrics = {
+ "train_accuracy": accuracy_score(y_train, y_train_pred),
+ "test_accuracy": accuracy_score(y_test, y_test_pred),
+ "train_logloss": log_loss(y_train, y_train_proba),
+ "test_logloss": log_loss(y_test, y_test_proba),
+ "train_auc": roc_auc_score(y_train, y_train_proba),
+ "test_auc": roc_auc_score(y_test, y_test_proba),
+ }
+
+ logger.info("\n=== Model Performance ===")
+ logger.info("Train Accuracy: %.4f", metrics["train_accuracy"])
+ logger.info("Test Accuracy: %.4f", metrics["test_accuracy"])
+ logger.info("Train LogLoss: %.4f", metrics["train_logloss"])
+ logger.info("Test LogLoss: %.4f", metrics["test_logloss"])
+ logger.info("Train AUC: %.4f", metrics["train_auc"])
+ logger.info("Test AUC: %.4f", metrics["test_auc"])
+
+ # Feature importance
+ feature_importance = pd.DataFrame(
+ {"feature": X.columns, "importance": model.feature_importances_}
+ ).sort_values("importance", ascending=False)
+
+ logger.info("\n=== Top 10 Features ===")
+ for _, row in feature_importance.head(10).iterrows():
+ logger.info("%s: %.4f", row["feature"], row["importance"])
+
+ return model, metrics
+
+
+def save_model(model: xgb.XGBClassifier, output_path: Path) -> None:
+ """Save trained model to JSON format.
+
+ Args:
+ model: Trained XGBoost model
+ output_path: Path to save model file
+ """
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ model.save_model(str(output_path))
+ logger.info("Model saved to %s", output_path)
+
+
+def main() -> None:
+ """Train spreads prediction model."""
+ parser = argparse.ArgumentParser(description="Train XGBoost spreads model")
+ parser.add_argument(
+ "--data",
+ type=Path,
+ required=True,
+ help="Path to training data parquet file",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=Path("models/spreads_model.json"),
+ help="Path to save trained model",
+ )
+ parser.add_argument(
+ "--feature-set",
+ choices=["basic", "full"],
+ default="basic",
+ help="Feature set to use (basic or full)",
+ )
+ parser.add_argument(
+ "--test-size",
+ type=float,
+ default=0.2,
+ help="Proportion of data for testing",
+ )
+ parser.add_argument(
+ "--random-state",
+ type=int,
+ default=42,
+ help="Random seed for reproducibility",
+ )
+
+ args = parser.parse_args()
+
+ # Load data
+ X, y = load_dataset(args.data, args.feature_set)
+
+ # Train model
+ model, _metrics = train_model(X, y, test_size=args.test_size, random_state=args.random_state)
+
+ # Save model
+ save_model(model, args.output)
+
+ logger.info("\n[OK] Spreads model training complete!")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/verify_odds_streaming.py b/scripts/archive/2026-02-deprecated/verify_odds_streaming.py
new file mode 100644
index 000000000..b46a979a4
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/verify_odds_streaming.py
@@ -0,0 +1,250 @@
+"""Verify Odds API streaming service is working correctly.
+
+Checks:
+1. Database exists and schema is valid
+2. Recent data collection (last 5 minutes)
+3. Observations are being stored correctly
+4. Normalized views are working
+5. Quota usage is reasonable
+
+Usage:
+ uv run python scripts/verify_odds_streaming.py
+ uv run python scripts/verify_odds_streaming.py --db data/custom.sqlite3
+"""
+
+import argparse
+import sys
+from datetime import UTC, datetime
+from pathlib import Path
+
+# Ensure parent is in path for imports
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+
+def check_database_exists(db_path: Path) -> bool:
+ """Check if database file exists."""
+ if not db_path.exists():
+ print(f"[ERROR] Database not found: {db_path}")
+ print(" Run streaming service to create database:")
+ print(" uv run python scripts/stream_odds_api.py --once")
+ return False
+
+ print(f"[OK] Database exists: {db_path}")
+ return True
+
+
+def check_schema(db: OddsAPIDatabase) -> bool:
+ """Check if required tables exist."""
+ required_tables = ["events", "observations", "scores"]
+
+ for table in required_tables:
+ result = db.conn.execute(
+ "SELECT name FROM sqlite_master WHERE type='table' AND name=?",
+ (table,),
+ ).fetchone()
+
+ if not result:
+ print(f"[ERROR] Missing table: {table}")
+ return False
+
+ print(f"[OK] Schema valid (tables: {', '.join(required_tables)})")
+ return True
+
+
+def check_normalized_views(db: OddsAPIDatabase) -> bool:
+ """Check if normalized views exist."""
+ required_views = ["canonical_spreads", "canonical_totals"]
+
+ for view in required_views:
+ result = db.conn.execute(
+ "SELECT name FROM sqlite_master WHERE type='view' AND name=?",
+ (view,),
+ ).fetchone()
+
+ if not result:
+ print(f"[WARNING] Missing view: {view}")
+ print(" This is normal for new databases - views create automatically")
+ return True # Not fatal
+
+ print("[OK] Normalized views exist")
+ return True
+
+
+def check_recent_collection(db: OddsAPIDatabase) -> bool:
+ """Check if data has been collected recently."""
+ # Get most recent observation
+ result = db.conn.execute(
+ """
+ SELECT MAX(as_of) as latest_collection
+ FROM observations
+ """
+ ).fetchone()
+
+ if not result or not result[0]:
+ print("[WARNING] No observations found in database")
+ print(" Run streaming service to start collection:")
+ print(" uv run python scripts/stream_odds_api.py --once")
+ return True # Not fatal for new databases
+
+ latest_str = result[0]
+ latest = datetime.fromisoformat(latest_str.replace("Z", "+00:00"))
+ now = datetime.now(UTC)
+ age_minutes = (now - latest).total_seconds() / 60
+
+ if age_minutes > 5:
+ print(f"[WARNING] Latest collection was {age_minutes:.1f} minutes ago")
+ print(" Expected: < 1 minute (30-second intervals)")
+ print(" Check if streaming daemon is running:")
+ print(" Get-ScheduledTask -TaskName 'OddsAPIStreaming'")
+ return True # Warning, not error
+
+ print(f"[OK] Recent collection: {age_minutes:.1f} minutes ago")
+ return True
+
+
+def check_data_coverage(db: OddsAPIDatabase) -> bool:
+ """Check data coverage and quality."""
+ stats = db.get_database_stats()
+
+ total_events = stats["total_events"]
+ events_with_scores = stats["events_with_scores"]
+
+ print("[OK] Data coverage:")
+ print(f" Total events: {total_events}")
+ print(f" Events with scores: {events_with_scores}")
+
+ if total_events == 0:
+ print("[WARNING] No events in database yet")
+ print(" This is normal for a new database")
+ return True
+
+ # Check bookmaker coverage
+ bookmaker_coverage = stats.get("bookmaker_coverage", [])
+ if bookmaker_coverage:
+ print(" Bookmaker coverage:")
+ for book in bookmaker_coverage[:5]: # Top 5
+ book_key = book["book_key"]
+ games_covered = book["games_covered"]
+ coverage_pct = book.get("coverage_pct", 0)
+ print(f" {book_key}: {games_covered} games ({coverage_pct:.1f}%)")
+
+ return True
+
+
+def check_observation_counts(db: OddsAPIDatabase) -> bool:
+ """Check observation counts for sanity."""
+ # Get counts by market
+ result = db.conn.execute(
+ """
+ SELECT
+ market_key,
+ COUNT(*) as count,
+ COUNT(DISTINCT event_id) as unique_events
+ FROM observations
+ GROUP BY market_key
+ """
+ ).fetchall()
+
+ if not result:
+ print("[WARNING] No observations found")
+ return True
+
+ print("[OK] Observations by market:")
+ for market_key, count, unique_events in result:
+ print(f" {market_key}: {count:,} observations ({unique_events} events)")
+
+ return True
+
+
+def check_quota_health(db: OddsAPIDatabase) -> bool:
+ """Check if quota usage is reasonable."""
+ # This is informational only - can't check quota without API call
+ print("[INFO] Quota health check skipped (requires API call)")
+ print(" Check logs for quota warnings:")
+ print(" Get-Content data\\logs\\odds_api_streaming.log -Tail 50")
+ return True
+
+
+def main() -> None:
+ """Run all verification checks."""
+ parser = argparse.ArgumentParser(description="Verify Odds API streaming service")
+ parser.add_argument(
+ "--db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to SQLite database",
+ )
+
+ args = parser.parse_args()
+
+ print("=" * 80)
+ print("Odds API Streaming Verification")
+ print("=" * 80)
+ print()
+
+ # Check 1: Database exists
+ if not check_database_exists(args.db):
+ print()
+ print("=" * 80)
+ print("[FAILED] Verification failed - database not found")
+ print("=" * 80)
+ sys.exit(1)
+
+ # Open database
+ db = OddsAPIDatabase(args.db)
+
+ try:
+ # Check 2: Schema
+ print()
+ if not check_schema(db):
+ print()
+ print("=" * 80)
+ print("[FAILED] Verification failed - schema invalid")
+ print("=" * 80)
+ sys.exit(1)
+
+ # Check 3: Normalized views
+ print()
+ check_normalized_views(db)
+
+ # Check 4: Recent collection
+ print()
+ check_recent_collection(db)
+
+ # Check 5: Data coverage
+ print()
+ check_data_coverage(db)
+
+ # Check 6: Observation counts
+ print()
+ check_observation_counts(db)
+
+ # Check 7: Quota health
+ print()
+ check_quota_health(db)
+
+ print()
+ print("=" * 80)
+ print("[SUCCESS] All verification checks passed")
+ print("=" * 80)
+ print()
+ print("Next Steps:")
+ print()
+ print("1. Start streaming daemon (if not already running):")
+ print(" Start-ScheduledTask -TaskName 'OddsAPIStreaming'")
+ print()
+ print("2. Monitor logs:")
+ print(" Get-Content data\\logs\\odds_api_streaming.log -Tail 50 -Wait")
+ print()
+ print("3. Query line movements:")
+ print(" python scripts/analyze_line_movement.py")
+ print()
+
+ finally:
+ db.close()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/view_collected_odds.py b/scripts/archive/2026-02-deprecated/view_collected_odds.py
new file mode 100644
index 000000000..95b04239f
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/view_collected_odds.py
@@ -0,0 +1,128 @@
+"""View collected overtime.ag odds data."""
+
+import sys
+from pathlib import Path
+
+import pandas as pd
+
+pd.set_option("display.max_columns", None)
+pd.set_option("display.width", 120)
+
+
+def main() -> None:
+ """Display collected odds data."""
+ data_dir = Path("data/overtime/basketball")
+
+ if not data_dir.exists():
+ print(f"[ERROR] Directory not found: {data_dir}")
+ print(
+ "Run the collection service first: uv run python scripts/overtime_collector_service.py"
+ )
+ sys.exit(1)
+
+ # Find all Parquet files
+ files = sorted(data_dir.glob("*.parquet"))
+
+ if not files:
+ print(f"[ERROR] No Parquet files found in {data_dir}")
+ sys.exit(1)
+
+ print("=" * 120)
+ print("OVERTIME.AG COLLECTED ODDS")
+ print("=" * 120)
+ print(f"Collections found: {len(files)}")
+ print(f"Directory: {data_dir.absolute()}")
+ print()
+
+ # Show most recent collection
+ latest = files[-1]
+ df = pd.read_parquet(latest)
+
+ print(f"Latest collection: {latest.name}")
+ print(f"Collected at: {df['collected_at'].iloc[0]}")
+ print(f"Games: {len(df)}")
+ print()
+
+ # Group by period (show only full game, not halves/quarters)
+ full_game = df[df["period_number"] == 0].copy()
+
+ print("=" * 120)
+ print("CURRENT ODDS (Full Game)")
+ print("=" * 120)
+ print()
+
+ for _, game in full_game.iterrows():
+ print(f"{game['team1_name']} @ {game['team2_name']}")
+ print(f" Game Time: {game['game_datetime']}")
+ print(f" Rotation: {game['team1_rot_num']} / {game['team2_rot_num']}")
+ print()
+ print(f" Spread: {game['spread_points']} (Favored: {game['spread_favored_team']})")
+ print(f" {game['team1_name']}: {game['spread1_juice']}")
+ print(f" {game['team2_name']}: {game['spread2_juice']}")
+ print()
+ print(f" Total: {game['total_points']}")
+ print(f" Over: {game['over_juice']}")
+ print(f" Under: {game['under_juice']}")
+ print()
+ print(" Moneyline:")
+ print(f" {game['team1_name']}: {game['moneyline1_american']}")
+ print(f" {game['team2_name']}: {game['moneyline2_american']}")
+ print()
+ print(" Team Totals:")
+ print(f" {game['team1_name']}: {game['team1_total_points']}")
+ print(f" {game['team2_name']}: {game['team2_total_points']}")
+ print()
+ print("-" * 120)
+ print()
+
+ # If multiple collections, show line movements
+ if len(files) > 1:
+ print("=" * 120)
+ print("LINE MOVEMENTS")
+ print("=" * 120)
+
+ df_old = pd.read_parquet(files[-2])
+ df_old = df_old[df_old["period_number"] == 0]
+
+ comparison = pd.merge(
+ df_old[["game_num", "team1_name", "spread_points", "total_points"]],
+ full_game[["game_num", "spread_points", "total_points"]],
+ on="game_num",
+ suffixes=("_old", "_new"),
+ )
+
+ comparison["spread_movement"] = (
+ comparison["spread_points_new"] - comparison["spread_points_old"]
+ )
+ comparison["total_movement"] = (
+ comparison["total_points_new"] - comparison["total_points_old"]
+ )
+
+ movers = comparison[
+ (comparison["spread_movement"] != 0) | (comparison["total_movement"] != 0)
+ ]
+
+ if len(movers) > 0:
+ print(f"Games with line movement: {len(movers)}")
+ print()
+ for _, row in movers.iterrows():
+ print(f"{row['team1_name']}")
+ if row["spread_movement"] != 0:
+ spread_old = row["spread_points_old"]
+ spread_new = row["spread_points_new"]
+ spread_move = row["spread_movement"]
+ print(f" Spread: {spread_old} -> {spread_new} ({spread_move:+.1f})")
+ if row["total_movement"] != 0:
+ total_old = row["total_points_old"]
+ total_new = row["total_points_new"]
+ total_move = row["total_movement"]
+ print(f" Total: {total_old} -> {total_new} ({total_move:+.1f})")
+ print()
+ else:
+ print("No line movements detected")
+
+ print("=" * 120)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/2026-02-deprecated/walk_forward_training_fast.py b/scripts/archive/2026-02-deprecated/walk_forward_training_fast.py
new file mode 100644
index 000000000..253503f3d
--- /dev/null
+++ b/scripts/archive/2026-02-deprecated/walk_forward_training_fast.py
@@ -0,0 +1,248 @@
+#!/usr/bin/env python3
+"""Walk-Forward Training with date-enhanced datasets."""
+
+from datetime import timedelta
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+import xgboost as xgb
+from sklearn.metrics import accuracy_score, log_loss, roc_auc_score
+
+# Configuration
+TRAINING_WINDOW_DAYS = 21
+VALIDATION_WINDOW_DAYS = 7
+STEP_SIZE_DAYS = 7
+MIN_GAMES_PER_WINDOW = 50
+
+MODEL_PARAMS = {
+ "n_estimators": 150,
+ "max_depth": 3,
+ "learning_rate": 0.05,
+ "min_child_weight": 5,
+ "subsample": 0.8,
+ "colsample_bytree": 0.8,
+ "reg_alpha": 0.1,
+ "reg_lambda": 1.0,
+ "objective": "binary:logistic",
+ "eval_metric": "logloss",
+ "random_state": 42,
+ "use_label_encoder": False,
+}
+
+print("=" * 80)
+print("WALK-FORWARD VALIDATION TRAINING")
+print("=" * 80)
+
+# Load datasets with dates
+spreads_path = Path("data/ml/spreads_with_dates_2026.parquet")
+totals_path = Path("data/ml/totals_with_dates_2026.parquet")
+
+print("\n[LOADING DATASETS]")
+spreads_df = pd.read_parquet(spreads_path)
+totals_df = pd.read_parquet(totals_path)
+
+spreads_df["game_date"] = pd.to_datetime(spreads_df["game_date"])
+totals_df["game_date"] = pd.to_datetime(totals_df["game_date"])
+
+spreads_min = spreads_df["game_date"].min().date()
+spreads_max = spreads_df["game_date"].max().date()
+totals_min = totals_df["game_date"].min().date()
+totals_max = totals_df["game_date"].max().date()
+print(f"Spreads: {len(spreads_df)} games | {spreads_min} to {spreads_max}")
+print(f"Totals: {len(totals_df)} games | {totals_min} to {totals_max}")
+
+
+def create_walk_forward_splits(df):
+ """Create temporal train/validation splits."""
+ df = df.sort_values("game_date")
+ min_date = df["game_date"].min()
+ max_date = df["game_date"].max()
+
+ splits = []
+ val_start = min_date + timedelta(days=TRAINING_WINDOW_DAYS)
+
+ while True:
+ val_end = val_start + timedelta(days=VALIDATION_WINDOW_DAYS)
+ if val_end > max_date:
+ break
+
+ train_start = val_start - timedelta(days=TRAINING_WINDOW_DAYS)
+ train_mask = (df["game_date"] >= train_start) & (df["game_date"] < val_start)
+ val_mask = (df["game_date"] >= val_start) & (df["game_date"] < val_end)
+
+ train_df = df[train_mask]
+ val_df = df[val_mask]
+
+ if len(train_df) >= MIN_GAMES_PER_WINDOW and len(val_df) >= 10:
+ splits.append((train_df, val_df, f"{val_start.date()}_to_{val_end.date()}"))
+
+ val_start += timedelta(days=STEP_SIZE_DAYS)
+
+ return splits
+
+
+def train_and_evaluate(splits, target_col="target"):
+ """Train model on each window and aggregate results."""
+ all_preds = []
+ all_probs = []
+ all_actuals = []
+ results = []
+
+ for i, (train_df, val_df, period) in enumerate(splits, 1):
+ print(f"\n[Window {i}/{len(splits)}] {period}")
+ print(f" Train: {len(train_df)} games")
+ print(f" Val: {len(val_df)} games")
+
+ y_train = train_df[target_col]
+ y_val = val_df[target_col]
+
+ feature_cols = [col for col in train_df.columns if col not in [target_col, "game_date"]]
+
+ X_train = train_df[feature_cols].fillna(0)
+ X_val = val_df[feature_cols].fillna(0)
+
+ print(f" Features: {len(feature_cols)}")
+
+ model = xgb.XGBClassifier(**MODEL_PARAMS)
+ model.fit(X_train, y_train, verbose=False)
+
+ y_pred = model.predict(X_val)
+ y_proba = model.predict_proba(X_val)[:, 1]
+
+ acc = accuracy_score(y_val, y_pred)
+ try:
+ auc = roc_auc_score(y_val, y_proba)
+ ll = log_loss(y_val, y_proba)
+ except Exception:
+ auc, ll = 0.5, np.nan
+
+ print(f" Accuracy: {acc:.4f} | AUC: {auc:.4f} | LogLoss: {ll:.4f}")
+
+ all_preds.extend(y_pred)
+ all_probs.extend(y_proba)
+ all_actuals.extend(y_val)
+
+ results.append(
+ {
+ "period": period,
+ "train_games": len(train_df),
+ "val_games": len(val_df),
+ "accuracy": acc,
+ "auc": auc,
+ "logloss": ll,
+ }
+ )
+
+ overall_acc = accuracy_score(all_actuals, all_preds)
+ overall_auc = roc_auc_score(all_actuals, all_probs)
+ overall_ll = log_loss(all_actuals, all_probs)
+
+ return {
+ "overall_accuracy": overall_acc,
+ "overall_auc": overall_auc,
+ "overall_logloss": overall_ll,
+ "window_results": results,
+ }
+
+
+# Train spreads model
+print("\n" + "=" * 80)
+print("SPREADS MODEL - WALK-FORWARD VALIDATION")
+print("=" * 80)
+
+spreads_splits = create_walk_forward_splits(spreads_df)
+
+if len(spreads_splits) == 0:
+ print("\n[WARN] Not enough data for walk-forward, using 80/20 split")
+ split_idx = int(len(spreads_df) * 0.8)
+ train_df = spreads_df.iloc[:split_idx]
+ val_df = spreads_df.iloc[split_idx:]
+ spreads_splits = [(train_df, val_df, "single_split")]
+
+print(f"\nCreated {len(spreads_splits)} validation windows")
+spreads_results = train_and_evaluate(spreads_splits)
+
+# Train totals model
+print("\n" + "=" * 80)
+print("TOTALS MODEL - WALK-FORWARD VALIDATION")
+print("=" * 80)
+
+totals_splits = create_walk_forward_splits(totals_df)
+
+if len(totals_splits) == 0:
+ print("\n[WARN] Not enough data for walk-forward, using 80/20 split")
+ split_idx = int(len(totals_df) * 0.8)
+ train_df = totals_df.iloc[:split_idx]
+ val_df = totals_df.iloc[split_idx:]
+ totals_splits = [(train_df, val_df, "single_split")]
+
+print(f"\nCreated {len(totals_splits)} validation windows")
+totals_results = train_and_evaluate(totals_splits)
+
+# Final model training on all data
+print("\n" + "=" * 80)
+print("TRAINING FINAL MODELS ON ALL DATA")
+print("=" * 80)
+
+print("\n[Spreads Model]")
+y = spreads_df["target"]
+X = spreads_df[[col for col in spreads_df.columns if col not in ["target", "game_date"]]].fillna(0)
+
+final_spreads = xgb.XGBClassifier(**MODEL_PARAMS)
+final_spreads.fit(X, y, verbose=False)
+
+spreads_model_path = Path("data/models/spreads_model.json")
+final_spreads.save_model(str(spreads_model_path))
+print(f" [SAVED] {spreads_model_path}")
+
+print("\n[Totals Model]")
+y = totals_df["target"]
+X = totals_df[[col for col in totals_df.columns if col not in ["target", "game_date"]]].fillna(0)
+
+final_totals = xgb.XGBClassifier(**MODEL_PARAMS)
+final_totals.fit(X, y, verbose=False)
+
+totals_model_path = Path("data/models/totals_model.json")
+final_totals.save_model(str(totals_model_path))
+print(f" [SAVED] {totals_model_path}")
+
+# Summary
+print("\n" + "=" * 80)
+print("WALK-FORWARD VALIDATION SUMMARY")
+print("=" * 80)
+
+print("\n[SPREADS MODEL]")
+print(f" Overall Accuracy: {spreads_results['overall_accuracy']:.4f}")
+print(f" Overall AUC: {spreads_results['overall_auc']:.4f}")
+print(f" Overall LogLoss: {spreads_results['overall_logloss']:.4f}")
+print(f" Validation windows: {len(spreads_results['window_results'])}")
+
+print("\n[TOTALS MODEL]")
+print(f" Overall Accuracy: {totals_results['overall_accuracy']:.4f}")
+print(f" Overall AUC: {totals_results['overall_auc']:.4f}")
+print(f" Overall LogLoss: {totals_results['overall_logloss']:.4f}")
+print(f" Validation windows: {len(totals_results['window_results'])}")
+
+print("\n" + "=" * 80)
+print("PERFORMANCE ANALYSIS")
+print("=" * 80)
+
+if spreads_results["overall_accuracy"] > 0.52:
+ print("\n✅ SPREADS MODEL: PROFITABLE")
+ print(f" Accuracy: {spreads_results['overall_accuracy']:.2%} (above 52% threshold)")
+ print(f" Expected ROI: ~{(spreads_results['overall_accuracy'] - 0.524) * 100:.1f}% per bet")
+else:
+ print(f"\n⚠️ SPREADS MODEL: {spreads_results['overall_accuracy']:.2%} (below 52% threshold)")
+
+if totals_results["overall_accuracy"] > 0.52:
+ print("\n✅ TOTALS MODEL: PROFITABLE")
+ print(f" Accuracy: {totals_results['overall_accuracy']:.2%} (above 52% threshold)")
+ print(f" Expected ROI: ~{(totals_results['overall_accuracy'] - 0.524) * 100:.1f}% per bet")
+else:
+ print(f"\n⚠️ TOTALS MODEL: {totals_results['overall_accuracy']:.2%} (below 52% threshold)")
+
+print("\n[MODELS READY FOR DEPLOYMENT]")
+print("Models trained on full dataset and saved.")
+print("Note: These models expect 30+ features including Four Factors.")
+print("For tonight, recommend using KenPom-based predictions (simpler approach).")
diff --git a/scripts/archive/collect_daily.py b/scripts/archive/collect_daily.py
new file mode 100644
index 000000000..7363f1995
--- /dev/null
+++ b/scripts/archive/collect_daily.py
@@ -0,0 +1,389 @@
+"""Daily odds and scores collection for NCAA Basketball.
+
+Collects:
+1. Current odds (spreads, totals, moneylines) from all bookmakers
+2. Scores for recently completed games
+3. Logs collection metrics
+
+Designed to run once daily (or multiple times per day during game days).
+
+Usage:
+ # Collect current odds + scores from last 3 days
+ uv run python scripts/collect_daily.py
+
+ # Custom date range for scores
+ uv run python scripts/collect_daily.py --scores-days 7
+
+Environment:
+ ODDS_API_KEY: Required - Your Odds API key
+"""
+
+import argparse
+import logging
+import os
+import sys
+from datetime import date, datetime, timedelta
+from pathlib import Path
+
+import httpx
+
+# Ensure log directory exists
+Path("data/logs").mkdir(parents=True, exist_ok=True)
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s - %(levelname)s - %(message)s",
+ handlers=[
+ logging.StreamHandler(sys.stdout),
+ logging.FileHandler("data/logs/daily_collection.log"),
+ ],
+)
+logger = logging.getLogger(__name__)
+
+
+def check_api_key() -> str:
+ """Check that ODDS_API_KEY environment variable is set.
+
+ Returns:
+ API key
+
+ Raises:
+ SystemExit: If API key not found
+ """
+ api_key = os.getenv("ODDS_API_KEY")
+ if not api_key:
+ logger.error("ODDS_API_KEY environment variable not set!")
+ logger.error("Set it with: export ODDS_API_KEY='your_key_here'")
+ sys.exit(1)
+ return api_key
+
+
+def collect_current_odds(api_key: str, db_path: Path) -> dict:
+ """Collect current odds for all upcoming games.
+
+ Args:
+ api_key: Odds API key
+ db_path: Path to SQLite database
+
+ Returns:
+ Collection metrics
+ """
+ from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+ logger.info("=" * 60)
+ logger.info("COLLECTING CURRENT ODDS")
+ logger.info("=" * 60)
+
+ base_url = "https://api.the-odds-api.com/v4"
+ sport = "basketball_ncaab"
+ regions = "us,us2"
+ markets = "h2h,spreads,totals"
+
+ # Get upcoming games
+ url = f"{base_url}/sports/{sport}/odds/"
+ params = {
+ "apiKey": api_key,
+ "regions": regions,
+ "markets": markets,
+ "oddsFormat": "american",
+ }
+
+ logger.info(f"Fetching odds from {url}")
+ logger.info(f"Regions: {regions}")
+ logger.info(f"Markets: {markets}")
+
+ try:
+ with httpx.Client(timeout=30.0) as client:
+ response = client.get(url, params=params)
+ response.raise_for_status()
+
+ # Check quota
+ remaining = response.headers.get("x-requests-remaining")
+ used = response.headers.get("x-requests-used")
+ logger.info(f"API Quota - Used: {used}, Remaining: {remaining}")
+
+ odds_data = response.json()
+ logger.info(f"Retrieved {len(odds_data)} events with odds")
+
+ # Store in database
+ if len(odds_data) > 0:
+ db = OddsAPIDatabase(db_path)
+ try:
+ # Store events and observations
+ events_stored = 0
+ observations_stored = 0
+ as_of = datetime.now().isoformat()
+
+ for event in odds_data:
+ event_id = event["id"]
+ home_team = event["home_team"]
+ away_team = event["away_team"]
+ commence_time = event["commence_time"]
+
+ # Store event
+ db.conn.execute(
+ """
+ INSERT OR REPLACE INTO events
+ (event_id, sport_key, home_team, away_team, commence_time, created_at)
+ VALUES (?, ?, ?, ?, ?, ?)
+ """,
+ (event_id, sport, home_team, away_team, commence_time, as_of),
+ )
+ events_stored += 1
+
+ # Store odds observations
+ for bookmaker in event.get("bookmakers", []):
+ book_key = bookmaker["key"]
+
+ for market in bookmaker.get("markets", []):
+ market_key = market["key"]
+
+ for outcome in market.get("outcomes", []):
+ outcome_name = outcome["name"]
+ price = outcome.get("price")
+ point = outcome.get("point")
+
+ db.conn.execute(
+ """
+ INSERT INTO observations
+ (event_id, book_key, market_key, outcome_name,
+ price_american, point, as_of, fetched_at, sport_key)
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ event_id,
+ book_key,
+ market_key,
+ outcome_name,
+ price,
+ point,
+ as_of,
+ as_of,
+ sport,
+ ),
+ )
+ observations_stored += 1
+
+ db.conn.commit()
+ logger.info(f"[OK] Stored {events_stored} events")
+ logger.info(f"[OK] Stored {observations_stored} observations")
+
+ return {
+ "events": events_stored,
+ "observations": observations_stored,
+ "quota_remaining": remaining,
+ }
+ finally:
+ db.close()
+ else:
+ logger.warning("No events retrieved from API")
+ return {"events": 0, "observations": 0, "quota_remaining": remaining}
+
+ except httpx.HTTPStatusError as e:
+ logger.error(f"HTTP error: {e.response.status_code}")
+ logger.error(f"Response: {e.response.text}")
+ raise
+ except httpx.RequestError as e:
+ logger.error(f"Request error: {e}")
+ raise
+
+
+def collect_recent_scores(api_key: str, db_path: Path, days_back: int = 3) -> dict:
+ """Collect scores for recently completed games.
+
+ Args:
+ api_key: Odds API key
+ db_path: Path to SQLite database
+ days_back: How many days back to collect scores
+
+ Returns:
+ Collection metrics
+ """
+ from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+ logger.info("=" * 60)
+ logger.info(f"COLLECTING SCORES (LAST {days_back} DAYS)")
+ logger.info("=" * 60)
+
+ base_url = "https://api.the-odds-api.com/v4"
+ sport = "basketball_ncaab"
+
+ # Calculate date range
+ end_date = date.today()
+ start_date = end_date - timedelta(days=days_back)
+
+ logger.info(f"Date range: {start_date} to {end_date}")
+
+ scores_stored = 0
+ db = OddsAPIDatabase(db_path)
+
+ try:
+ current_date = start_date
+ while current_date <= end_date:
+ date_str = current_date.strftime("%Y-%m-%d")
+ url = f"{base_url}/sports/{sport}/scores/"
+
+ params = {
+ "apiKey": api_key,
+ "daysFrom": 1, # Just this specific date
+ "dateFormat": "iso",
+ }
+
+ logger.info(f"Fetching scores for {date_str}")
+
+ try:
+ with httpx.Client(timeout=30.0) as client:
+ response = client.get(url, params=params)
+ response.raise_for_status()
+
+ # Check quota
+ remaining = response.headers.get("x-requests-remaining")
+ logger.info(f"API Quota Remaining: {remaining}")
+
+ scores_data = response.json()
+ completed_games = [g for g in scores_data if g.get("completed") is True]
+
+ logger.info(f"Found {len(completed_games)} completed games on {date_str}")
+
+ # Store scores
+ for game in completed_games:
+ event_id = game["id"]
+ home_team = game["home_team"]
+ away_team = game["away_team"]
+
+ # Get scores from the 'scores' field
+ scores = game.get("scores")
+ if not scores or len(scores) < 2:
+ continue
+
+ # Find home and away scores
+ home_score = None
+ away_score = None
+
+ for score in scores:
+ if score["name"] == home_team:
+ home_score = score.get("score")
+ elif score["name"] == away_team:
+ away_score = score.get("score")
+
+ if home_score is not None and away_score is not None:
+ db.conn.execute(
+ """
+ INSERT OR REPLACE INTO scores
+ (event_id, sport_key, completed, home_score, away_score,
+ last_update, fetched_at)
+ VALUES (?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ event_id,
+ "basketball_ncaab",
+ 1,
+ home_score,
+ away_score,
+ game.get("last_update", datetime.now().isoformat()),
+ datetime.now().isoformat(),
+ ),
+ )
+ scores_stored += 1
+
+ db.conn.commit()
+
+ except httpx.HTTPStatusError as e:
+ if e.response.status_code == 404:
+ logger.info(f"No scores available for {date_str}")
+ else:
+ logger.error(f"HTTP error for {date_str}: {e.response.status_code}")
+ except httpx.RequestError as e:
+ logger.error(f"Request error for {date_str}: {e}")
+
+ current_date += timedelta(days=1)
+
+ logger.info(f"[OK] Stored {scores_stored} total scores")
+ return {"scores": scores_stored}
+
+ finally:
+ db.close()
+
+
+def log_collection_summary(odds_metrics: dict, scores_metrics: dict) -> None:
+ """Log summary of collection run.
+
+ Args:
+ odds_metrics: Metrics from odds collection
+ scores_metrics: Metrics from scores collection
+ """
+ logger.info("=" * 60)
+ logger.info("COLLECTION SUMMARY")
+ logger.info("=" * 60)
+ logger.info(f"Timestamp: {datetime.now().isoformat()}")
+ logger.info("")
+ logger.info("Odds Collection:")
+ logger.info(f" Events stored: {odds_metrics.get('events', 0)}")
+ logger.info(f" Observations stored: {odds_metrics.get('observations', 0)}")
+ logger.info(f" API quota remaining: {odds_metrics.get('quota_remaining', 'unknown')}")
+ logger.info("")
+ logger.info("Scores Collection:")
+ logger.info(f" Scores stored: {scores_metrics.get('scores', 0)}")
+ logger.info("")
+ logger.info("[OK] Daily collection complete!")
+ logger.info("=" * 60)
+
+
+def main() -> None:
+ """Run daily collection."""
+ parser = argparse.ArgumentParser(description="Daily odds and scores collection")
+ parser.add_argument(
+ "--db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to SQLite database",
+ )
+ parser.add_argument(
+ "--scores-days",
+ type=int,
+ default=3,
+ help="Days back to collect scores (default: 3)",
+ )
+ parser.add_argument(
+ "--skip-odds",
+ action="store_true",
+ help="Skip odds collection (scores only)",
+ )
+ parser.add_argument(
+ "--skip-scores",
+ action="store_true",
+ help="Skip scores collection (odds only)",
+ )
+
+ args = parser.parse_args()
+
+ # Ensure log directory exists
+ Path("data/logs").mkdir(parents=True, exist_ok=True)
+
+ # Check API key
+ api_key = check_api_key()
+
+ logger.info("Starting daily collection...")
+ logger.info(f"Database: {args.db}")
+
+ try:
+ # Collect current odds
+ odds_metrics = {}
+ if not args.skip_odds:
+ odds_metrics = collect_current_odds(api_key, args.db)
+
+ # Collect recent scores
+ scores_metrics = {}
+ if not args.skip_scores:
+ scores_metrics = collect_recent_scores(api_key, args.db, args.scores_days)
+
+ # Log summary
+ log_collection_summary(odds_metrics, scores_metrics)
+
+ except Exception as e:
+ logger.error(f"Collection failed: {e}", exc_info=True)
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/archive/collect_odds_api_sample.py b/scripts/archive/collect_odds_api_sample.py
new file mode 100644
index 000000000..3251ce6da
--- /dev/null
+++ b/scripts/archive/collect_odds_api_sample.py
@@ -0,0 +1,137 @@
+"""Collect sample Odds API data for team name mapping.
+
+This script fetches current NCAA Men's Basketball odds from The Odds API
+and extracts unique team names for mapping to the canonical team table.
+
+Usage:
+ uv run python scripts/collect_odds_api_sample.py
+
+Output:
+ - data/odds_api/sample/ncaab_odds_YYYY-MM-DD.parquet
+ - Prints unique team names for mapping
+"""
+
+import asyncio
+import logging
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.odds_api import OddsAPIAdapter
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+async def collect_odds_sample() -> pd.DataFrame:
+ """Collect current NCAAB odds from The Odds API.
+
+ Returns:
+ DataFrame with odds data
+ """
+ logger.info("Initializing Odds API adapter...")
+ adapter = OddsAPIAdapter()
+
+ try:
+ logger.info("Fetching NCAA Men's Basketball odds...")
+ odds_data = await adapter.get_ncaab_odds()
+
+ logger.info(f"Retrieved {len(odds_data)} events from Odds API")
+
+ # Extract team names and event info
+ events = []
+ for event in odds_data:
+ events.append(
+ {
+ "event_id": event["id"],
+ "sport_key": event["sport_key"],
+ "commence_time": event["commence_time"],
+ "home_team": event["home_team"],
+ "away_team": event["away_team"],
+ "bookmaker_count": len(event.get("bookmakers", [])),
+ }
+ )
+
+ df = pd.DataFrame(events)
+ logger.info(f"Processed {len(df)} events")
+
+ # Log quota usage
+ quota_remaining = adapter.get_quota_remaining()
+ quota_used = adapter.get_quota_used()
+ if quota_remaining is not None:
+ logger.info(f"API quota remaining: {quota_remaining:,}")
+ if quota_used is not None:
+ logger.info(f"API quota used: {quota_used:,}")
+
+ return df
+
+ finally:
+ await adapter.close()
+
+
+def extract_unique_teams(df: pd.DataFrame) -> list[str]:
+ """Extract unique team names from odds data.
+
+ Args:
+ df: Odds data DataFrame
+
+ Returns:
+ Sorted list of unique team names
+ """
+ home_teams = df["home_team"].unique()
+ away_teams = df["away_team"].unique()
+ all_teams = sorted(set(list(home_teams) + list(away_teams)))
+ return all_teams
+
+
+def save_sample_data(df: pd.DataFrame, output_dir: Path) -> None:
+ """Save sample odds data to parquet.
+
+ Args:
+ df: Odds data DataFrame
+ output_dir: Directory to save data
+ """
+ output_dir.mkdir(parents=True, exist_ok=True)
+ today = datetime.now().strftime("%Y-%m-%d")
+ output_path = output_dir / f"ncaab_odds_{today}.parquet"
+
+ df.to_parquet(output_path, index=False)
+ logger.info(f"Saved sample data to {output_path}")
+
+
+def main() -> None:
+ """Collect Odds API sample data and extract team names."""
+ logger.info("Starting Odds API sample collection...")
+
+ # Collect data
+ df = asyncio.run(collect_odds_sample())
+
+ if len(df) == 0:
+ logger.warning("No events returned from Odds API")
+ logger.info("This may be normal if no games are scheduled soon")
+ return
+
+ # Extract unique team names
+ teams = extract_unique_teams(df)
+ logger.info(f"\nFound {len(teams)} unique teams in Odds API data:")
+ logger.info("=" * 80)
+ for i, team in enumerate(teams, 1):
+ print(f"{i:3}. {team}")
+ logger.info("=" * 80)
+
+ # Save sample data
+ output_dir = Path("data/odds_api/sample")
+ save_sample_data(df, output_dir)
+
+ # Summary
+ logger.info("\nSummary:")
+ logger.info(f" Events: {len(df)}")
+ logger.info(f" Unique teams: {len(teams)}")
+ logger.info(f" Date range: {df['commence_time'].min()} to {df['commence_time'].max()}")
+
+ logger.info("\nNext step: Run python scripts/map_odds_api_teams.py")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/check-docs.ts b/scripts/check-docs.ts
new file mode 100644
index 000000000..0919fa2a2
--- /dev/null
+++ b/scripts/check-docs.ts
@@ -0,0 +1,104 @@
+/**
+ * Minimal markdown smoke check for docs.
+ *
+ * Goals:
+ * - Ensure all markdown files under docs/ are readable.
+ * - Fail fast on obvious bad states (e.g. unresolved merge markers).
+ *
+ * This is intentionally lightweight and dependency-free so it can run quickly in CI.
+ */
+
+import fs from 'node:fs';
+import path from 'node:path';
+
+const DOCS_ROOT = path.join(process.cwd(), 'docs');
+
+function isMarkdownFile(filePath: string): boolean {
+ return filePath.toLowerCase().endsWith('.md');
+}
+
+function findMarkdownFiles(root: string): string[] {
+ const results: string[] = [];
+
+ function walk(current: string) {
+ const entries = fs.readdirSync(current, {withFileTypes: true});
+ for (const entry of entries) {
+ const fullPath = path.join(current, entry.name);
+ if (entry.isDirectory()) {
+ walk(fullPath);
+ } else if (entry.isFile() && isMarkdownFile(entry.name)) {
+ results.push(fullPath);
+ }
+ }
+ }
+
+ walk(root);
+ return results;
+}
+
+function checkFile(filePath: string): string[] {
+ const errors: string[] = [];
+ const content = fs.readFileSync(filePath, 'utf8');
+
+ if (!content.trim()) {
+ errors.push('file is empty');
+ }
+
+ const lines = content.split(/\r?\n/);
+ lines.forEach((line, index) => {
+ const lineNumber = index + 1;
+ const trimmed = line.trim();
+
+ // Detect real git conflict markers only:
+ // <<<<<<< HEAD
+ // =======
+ // >>>>>>> branch-name
+ if (
+ trimmed.startsWith('<<<<<<< ') ||
+ trimmed === '=======' ||
+ trimmed.startsWith('>>>>>>> ')
+ ) {
+ errors.push(`unresolved merge marker on line ${lineNumber}`);
+ }
+ });
+
+ return errors;
+}
+
+function main() {
+ if (!fs.existsSync(DOCS_ROOT)) {
+ console.error(`Docs directory not found at ${DOCS_ROOT}`);
+ process.exit(1);
+ }
+
+ const markdownFiles = findMarkdownFiles(DOCS_ROOT);
+ const allErrors: string[] = [];
+
+ for (const file of markdownFiles) {
+ try {
+ const errors = checkFile(file);
+ for (const err of errors) {
+ allErrors.push(`${path.relative(process.cwd(), file)}: ${err}`);
+ }
+ } catch (error) {
+ allErrors.push(
+ `${path.relative(process.cwd(), file)}: failed to read or parse file: ${
+ (error as Error).message
+ }`,
+ );
+ }
+ }
+
+ if (allErrors.length > 0) {
+ console.error('Docs smoke check failed with the following issues:\n');
+ for (const msg of allErrors) {
+ console.error(`- ${msg}`);
+ }
+ process.exit(1);
+ }
+
+ console.log(`Docs smoke check passed for ${markdownFiles.length} markdown file(s).`);
+}
+
+main();
+
diff --git a/scripts/collection/.gitkeep b/scripts/collection/.gitkeep
new file mode 100644
index 000000000..e69de29bb
diff --git a/scripts/collection/archive_daily_odds.py b/scripts/collection/archive_daily_odds.py
new file mode 100644
index 000000000..b5fd61175
--- /dev/null
+++ b/scripts/collection/archive_daily_odds.py
@@ -0,0 +1,181 @@
+"""Archive daily snapshot of current odds observations.
+
+This script should run daily (e.g., 11 PM Pacific) to preserve odds state
+for historical analysis and walk-forward validation.
+
+Usage:
+ # Archive current odds state
+ python scripts/collection/archive_daily_odds.py
+
+ # Archive with specific date (for manual backfill)
+ python scripts/collection/archive_daily_odds.py --date 2026-02-07
+
+ # Dry run (show what would be archived)
+ python scripts/collection/archive_daily_odds.py --dry-run
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import sys
+from datetime import date, datetime
+from pathlib import Path
+from zoneinfo import ZoneInfo
+
+import pandas as pd
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+PST = ZoneInfo("America/Los_Angeles")
+
+
+def archive_odds(
+ db_path: Path,
+ snapshot_date: date,
+ *,
+ dry_run: bool = False,
+) -> dict[str, int]:
+ """Archive current odds observations as a snapshot.
+
+ Args:
+ db_path: Path to SQLite database
+ snapshot_date: Date for this snapshot
+ dry_run: If True, don't actually create snapshots
+
+ Returns:
+ Dict with count of observations archived
+ """
+ logger.info(f"[OK] === Archiving Odds Snapshot for {snapshot_date} ===\n")
+
+ db = OddsAPIDatabase(db_path)
+
+ # Get current timestamp in Pacific time
+ snapshot_time = datetime.now(PST).isoformat()
+
+ # Query all current observations from the database
+ query = """
+ SELECT
+ obs_id as observation_id,
+ event_id,
+ book_key,
+ market_key,
+ outcome_name,
+ price_american,
+ price_decimal,
+ point,
+ book_last_update as last_update
+ FROM observations
+ """
+
+ observations_df = pd.read_sql_query(query, db.conn)
+
+ if len(observations_df) == 0:
+ logger.warning("No observations found to archive")
+ return {"total": 0, "spreads": 0, "totals": 0, "moneylines": 0}
+
+ logger.info(f"Found {len(observations_df)} observations to archive")
+
+ # Count by market type
+ market_counts = observations_df["market_key"].value_counts().to_dict()
+ logger.info(f" Spreads: {market_counts.get('spreads', 0)}")
+ logger.info(f" Totals: {market_counts.get('totals', 0)}")
+ logger.info(f" Moneylines: {market_counts.get('h2h', 0)}")
+
+ if dry_run:
+ logger.info("\n[DRY RUN] No snapshots created\n")
+ return {
+ "total": len(observations_df),
+ "spreads": market_counts.get("spreads", 0),
+ "totals": market_counts.get("totals", 0),
+ "moneylines": market_counts.get("h2h", 0),
+ }
+
+ # Convert DataFrame to list of dicts for create_snapshot
+ observations_list = observations_df.to_dict("records")
+
+ # Create snapshot records
+ count = db.create_snapshot(
+ snapshot_date=snapshot_date,
+ snapshot_time=snapshot_time,
+ observations=observations_list,
+ )
+
+ logger.info(f"\n[OK] Created {count} snapshot records")
+
+ # Show snapshot stats
+ stats = db.get_snapshot_stats()
+ logger.info("\n=== Snapshot Database Stats ===")
+ logger.info(f" Total snapshots: {stats['total_snapshots']:,}")
+ logger.info(f" Unique events: {stats['unique_events']}")
+ logger.info(f" Unique dates: {stats['unique_dates']}")
+ logger.info(f" Date range: {stats['earliest_date']} to {stats['latest_date']}")
+ logger.info(f" Unique bookmakers: {stats['unique_books']}")
+
+ return {
+ "total": count,
+ "spreads": market_counts.get("spreads", 0),
+ "totals": market_counts.get("totals", 0),
+ "moneylines": market_counts.get("h2h", 0),
+ }
+
+
+def main() -> None:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(
+ description="Archive daily odds snapshot",
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ )
+ parser.add_argument(
+ "--date",
+ type=str,
+ help="Snapshot date (YYYY-MM-DD, default: today in Pacific time)",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to SQLite database",
+ )
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Show what would be archived without creating snapshots",
+ )
+
+ args = parser.parse_args()
+
+ # Determine snapshot date
+ if args.date:
+ snapshot_date = datetime.fromisoformat(args.date).date()
+ else:
+ # Default to today in Pacific time
+ snapshot_date = datetime.now(PST).date()
+
+ try:
+ result = archive_odds(
+ db_path=args.db_path,
+ snapshot_date=snapshot_date,
+ dry_run=args.dry_run,
+ )
+
+ logger.info("\n[OK] Archive complete!")
+ logger.info(f" Date: {snapshot_date}")
+ logger.info(f" Total observations: {result['total']}")
+
+ sys.exit(0)
+
+ except Exception as e:
+ logger.error(f"[ERROR] Archive failed: {e}")
+ import traceback
+
+ traceback.print_exc()
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/collection/backfill_espn_events.py b/scripts/collection/backfill_espn_events.py
new file mode 100644
index 000000000..59a278e7d
--- /dev/null
+++ b/scripts/collection/backfill_espn_events.py
@@ -0,0 +1,294 @@
+"""Backfill historical events and scores from ESPN.
+
+Populates the database with comprehensive event coverage from ESPN,
+ensuring 100% score coverage for all past games.
+
+Usage:
+ # Backfill from start of season to today
+ uv run python scripts/backfill_espn_events.py --start 2025-12-28
+
+ # Backfill specific date range
+ uv run python scripts/backfill_espn_events.py --start 2025-12-28 --end 2026-01-31
+
+ # Dry run (show what would be added)
+ uv run python scripts/backfill_espn_events.py --start 2025-12-28 --dry-run
+"""
+
+import argparse
+import asyncio
+import logging
+import sys
+from datetime import date, datetime, timedelta
+from pathlib import Path
+
+# Ensure log directory exists
+Path("data/logs").mkdir(parents=True, exist_ok=True)
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s - %(levelname)s - %(message)s",
+ handlers=[
+ logging.StreamHandler(sys.stdout),
+ logging.FileHandler("data/logs/backfill_espn_events.log"),
+ ],
+)
+logger = logging.getLogger(__name__)
+
+
+async def backfill_espn_events(
+ db_path: Path,
+ start_date: date,
+ end_date: date,
+ dry_run: bool = False,
+) -> None:
+ """Backfill events and scores from ESPN.
+
+ Args:
+ db_path: Path to SQLite database
+ start_date: Start date (inclusive)
+ end_date: End date (inclusive)
+ dry_run: If True, show what would be added without storing
+ """
+ from sports_betting_edge.adapters.espn import fetch_scoreboard
+ from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+ from sports_betting_edge.core.event_id import generate_event_id
+ from sports_betting_edge.core.team_mapper import TeamMapper
+
+ logger.info("=" * 80)
+ logger.info("ESPN EVENTS BACKFILL")
+ logger.info("=" * 80)
+ logger.info(f"Date range: {start_date} to {end_date}")
+ logger.info(f"Dry run: {dry_run}")
+ logger.info("=" * 80)
+
+ # Load team mapper
+ try:
+ mapper = TeamMapper()
+ except FileNotFoundError:
+ logger.error("Team mapping not found. Run scripts/create_team_mapping.py first")
+ sys.exit(1)
+
+ db = OddsAPIDatabase(db_path) if not dry_run else None
+
+ events_stored = 0
+ events_updated = 0
+ scores_stored = 0
+ scores_updated = 0
+
+ try:
+ current_date = start_date
+
+ while current_date <= end_date:
+ logger.info(f"Processing {current_date}...")
+
+ try:
+ scoreboard = await fetch_scoreboard(current_date)
+ espn_games = scoreboard.get("events", [])
+
+ logger.info(f" Found {len(espn_games)} games on ESPN")
+
+ for espn_event in espn_games:
+ # Extract game details
+ competitions = espn_event.get("competitions", [])
+ if not competitions:
+ continue
+
+ competition = competitions[0]
+ competitors = competition.get("competitors", [])
+ if len(competitors) != 2:
+ continue
+
+ home_comp = next((c for c in competitors if c.get("homeAway") == "home"), None)
+ away_comp = next((c for c in competitors if c.get("homeAway") == "away"), None)
+
+ if not home_comp or not away_comp:
+ continue
+
+ # Get team names
+ espn_home = home_comp.get("team", {}).get("displayName", "")
+ espn_away = away_comp.get("team", {}).get("displayName", "")
+
+ # Map to Odds API team names
+ kenpom_home = mapper.get_kenpom_name(espn_home, source="espn")
+ kenpom_away = mapper.get_kenpom_name(espn_away, source="espn")
+ odds_home = mapper.get_odds_api_name(kenpom_home)
+ odds_away = mapper.get_odds_api_name(kenpom_away)
+
+ # Get commence time
+ game_date = espn_event.get("date", "")
+ if not game_date:
+ continue
+
+ # Generate deterministic event ID
+ event_id = generate_event_id(odds_home, odds_away, game_date, source="espn")
+
+ if dry_run:
+ logger.info(f" [DRY RUN] Would add: {odds_away} @ {odds_home}")
+ events_stored += 1
+ else:
+ # Check if event exists
+ existing = db.conn.execute(
+ "SELECT event_id, source FROM events WHERE event_id = ?",
+ (event_id,),
+ ).fetchone()
+
+ if existing:
+ # Update existing event
+ db.conn.execute(
+ """
+ UPDATE events
+ SET home_team = ?, away_team = ?, commence_time = ?
+ WHERE event_id = ?
+ """,
+ (odds_home, odds_away, game_date, event_id),
+ )
+ events_updated += 1
+ else:
+ # Insert new event
+ db.conn.execute(
+ """
+ INSERT INTO events
+ (event_id, sport_key, home_team, away_team, commence_time,
+ created_at, source, has_odds)
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ event_id,
+ "basketball_ncaab",
+ odds_home,
+ odds_away,
+ game_date,
+ datetime.now().isoformat(),
+ "espn",
+ 0, # Will be updated if odds added later
+ ),
+ )
+ events_stored += 1
+
+ # Check for scores (completed games)
+ status = espn_event.get("status", {})
+ status_type = status.get("type", {})
+ completed = status_type.get("completed", False)
+
+ if completed:
+ home_score = home_comp.get("score")
+ away_score = away_comp.get("score")
+
+ if home_score is not None and away_score is not None:
+ # Check if score exists
+ existing_score = db.conn.execute(
+ "SELECT event_id FROM scores WHERE event_id = ?",
+ (event_id,),
+ ).fetchone()
+
+ if existing_score:
+ # Update score
+ db.conn.execute(
+ """
+ UPDATE scores
+ SET home_score = ?, away_score = ?,
+ last_update = ?, fetched_at = ?
+ WHERE event_id = ?
+ """,
+ (
+ int(home_score),
+ int(away_score),
+ datetime.now().isoformat(),
+ datetime.now().isoformat(),
+ event_id,
+ ),
+ )
+ scores_updated += 1
+ else:
+ # Insert score
+ db.conn.execute(
+ """
+ INSERT INTO scores
+ (event_id, sport_key, completed, home_score,
+ away_score, last_update, fetched_at)
+ VALUES (?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ event_id,
+ "basketball_ncaab",
+ 1,
+ int(home_score),
+ int(away_score),
+ datetime.now().isoformat(),
+ datetime.now().isoformat(),
+ ),
+ )
+ scores_stored += 1
+
+ if not dry_run:
+ db.conn.commit()
+
+ except Exception as e:
+ logger.error(f"Error processing {current_date}: {e}")
+
+ # Rate limit
+ await asyncio.sleep(0.5)
+ current_date += timedelta(days=1)
+
+ logger.info("")
+ logger.info("=" * 80)
+ logger.info("BACKFILL SUMMARY")
+ logger.info("=" * 80)
+ logger.info(f"Date range: {start_date} to {end_date}")
+ logger.info(f"New events: {events_stored}")
+ logger.info(f"Updated events: {events_updated}")
+ logger.info(f"New scores: {scores_stored}")
+ logger.info(f"Updated scores: {scores_updated}")
+ logger.info(f"Total events: {events_stored + events_updated}")
+ logger.info(f"Total scores: {scores_stored + scores_updated}")
+
+ if not dry_run:
+ logger.info("")
+ logger.info("[OK] Backfill complete!")
+ else:
+ logger.info("")
+ logger.info("[DRY RUN] Complete (no changes made)")
+
+ finally:
+ if db:
+ db.close()
+
+
+def main() -> None:
+ """Run ESPN events backfill."""
+ parser = argparse.ArgumentParser(description="Backfill events and scores from ESPN")
+ parser.add_argument(
+ "--db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to SQLite database",
+ )
+ parser.add_argument(
+ "--start",
+ type=lambda s: datetime.fromisoformat(s).date(),
+ required=True,
+ help="Start date (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--end",
+ type=lambda s: datetime.fromisoformat(s).date(),
+ default=date.today(),
+ help="End date (YYYY-MM-DD, default: today)",
+ )
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Show what would be added without storing",
+ )
+
+ args = parser.parse_args()
+
+ try:
+ asyncio.run(backfill_espn_events(args.db, args.start, args.end, args.dry_run))
+ except Exception as e:
+ logger.error(f"Backfill failed: {e}", exc_info=True)
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/collection/backfill_espn_scores.py b/scripts/collection/backfill_espn_scores.py
new file mode 100644
index 000000000..ce2780906
--- /dev/null
+++ b/scripts/collection/backfill_espn_scores.py
@@ -0,0 +1,550 @@
+"""Backfill missing scores from ESPN scoreboard API.
+
+Uses ESPN's public scoreboard API to fill gaps in score collection where
+The Odds API historical data is unavailable (beyond 3-day limit).
+
+Usage:
+ # Auto-detect missing scores and backfill from ESPN
+ uv run python scripts/backfill_espn_scores.py --auto-detect
+
+ # Backfill specific date range
+ uv run python scripts/backfill_espn_scores.py \
+ --start 2025-12-28 \
+ --end 2026-01-23
+
+ # Dry run (show what would be fetched without storing)
+ uv run python scripts/backfill_espn_scores.py \
+ --start 2025-12-28 \
+ --end 2026-01-23 \
+ --dry-run
+
+Notes:
+ - Matches ESPN games to Odds API events using team mapper
+ - Only fills scores for events that exist in our database
+ - Stores in same format as Odds API scores
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+import sys
+from datetime import date, datetime, timedelta
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+
+from sports_betting_edge.adapters.espn import (
+ ESPNClient,
+ parse_espn_score,
+)
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.core.team_mapper import TeamMapper
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s - %(levelname)s - %(message)s",
+ handlers=[
+ logging.StreamHandler(sys.stdout),
+ logging.FileHandler("data/logs/backfill_espn_scores.log"),
+ ],
+)
+logger = logging.getLogger(__name__)
+
+
+def detect_missing_date_range(db_path: Path) -> tuple[date, date] | None:
+ """Detect date range with missing scores.
+
+ Args:
+ db_path: Path to SQLite database
+
+ Returns:
+ (start_date, end_date) tuple if gaps found, None otherwise
+ """
+ db = OddsAPIDatabase(str(db_path))
+
+ try:
+ query = """
+ WITH event_dates AS (
+ SELECT MIN(DATE(commence_time)) as earliest_event,
+ MAX(DATE(commence_time)) as latest_event
+ FROM events
+ WHERE DATE(commence_time) < DATE('now')
+ ),
+ score_dates AS (
+ SELECT MIN(DATE(e.commence_time)) as earliest_score
+ FROM scores s
+ INNER JOIN events e ON s.event_id = e.event_id
+ WHERE s.completed = 1
+ )
+ SELECT
+ event_dates.earliest_event,
+ score_dates.earliest_score,
+ event_dates.latest_event
+ FROM event_dates, score_dates
+ """
+
+ result = db.conn.execute(query).fetchone()
+
+ if not result:
+ logger.warning("No events found in database")
+ return None
+
+ earliest_event, earliest_score, latest_event = result
+
+ if earliest_score is None:
+ return (
+ datetime.fromisoformat(earliest_event).date(),
+ datetime.fromisoformat(latest_event).date(),
+ )
+
+ earliest_event_date = datetime.fromisoformat(earliest_event).date()
+ earliest_score_date = datetime.fromisoformat(earliest_score).date()
+
+ if earliest_event_date < earliest_score_date:
+ # Also check for missing scores in recent dates
+ gap_end = min(earliest_score_date - timedelta(days=1), date.today())
+ logger.info(
+ f"Gap detected: {earliest_event_date} to {gap_end} "
+ f"(scores start at {earliest_score_date})"
+ )
+ return (earliest_event_date, gap_end)
+
+ logger.info("[OK] No gaps detected in score collection")
+ return None
+
+ finally:
+ db.close()
+
+
+def store_espn_scores(
+ espn_events: list[dict[str, Any]],
+ db: OddsAPIDatabase,
+ team_mapper: TeamMapper,
+ target_date: date,
+ dry_run: bool = False,
+) -> dict[str, Any]:
+ """Parse and store scores from pre-fetched ESPN events.
+
+ Args:
+ espn_events: Raw event dicts from ESPN scoreboard API.
+ db: Database adapter.
+ team_mapper: Team name mapper.
+ target_date: Date these events belong to (for logging/event creation).
+ dry_run: If True, log what would be stored without making changes.
+
+ Returns:
+ Dictionary with collection metrics.
+ """
+ logger.info(
+ "Processing %d ESPN events for %s...",
+ len(espn_events),
+ target_date,
+ )
+
+ if not espn_events:
+ return {"scores_stored": 0, "scores_updated": 0, "no_games": True}
+
+ # Get our events for this date from database
+ our_events = pd.read_sql_query(
+ """
+ SELECT event_id, home_team, away_team, commence_time
+ FROM events
+ WHERE DATE(commence_time) = ?
+ """,
+ db.conn,
+ params=(target_date.strftime("%Y-%m-%d"),),
+ )
+
+ logger.info(
+ " We have %d events in database for %s",
+ len(our_events),
+ target_date,
+ )
+
+ scores_stored = 0
+ scores_updated = 0
+ scores_skipped = 0
+ events_created = 0
+ match_failures: list[str] = []
+
+ for espn_event in espn_events:
+ score = parse_espn_score(espn_event)
+ if score is None:
+ continue
+
+ if not score["completed"]:
+ continue
+
+ espn_home_team = score["espn_home_team"]
+ espn_away_team = score["espn_away_team"]
+ home_score = score["home_score"]
+ away_score = score["away_score"]
+
+ # Convert ESPN team names to Odds API names
+ odds_home_team = team_mapper.get_odds_api_name(
+ team_mapper.get_kenpom_name(espn_home_team, source="espn")
+ )
+ odds_away_team = team_mapper.get_odds_api_name(
+ team_mapper.get_kenpom_name(espn_away_team, source="espn")
+ )
+
+ # Find matching event in our database
+ matching_event = our_events[
+ (
+ (our_events["home_team"] == odds_home_team)
+ & (our_events["away_team"] == odds_away_team)
+ )
+ | (
+ (our_events["home_team"] == espn_home_team)
+ & (our_events["away_team"] == espn_away_team)
+ )
+ ]
+
+ # If no matching event exists, create one
+ if len(matching_event) == 0:
+ import hashlib
+
+ espn_id = score["game_id"]
+ event_id = (
+ espn_id
+ if espn_id
+ else hashlib.md5(
+ f"{odds_away_team}@{odds_home_team}_{target_date}".encode()
+ ).hexdigest()
+ )
+
+ commence_time = score["game_date"] if score["game_date"] else f"{target_date}T12:00:00Z"
+
+ if dry_run:
+ logger.info(
+ " [DRY RUN] Would create event: %s @ %s (event_id: %s)",
+ espn_away_team,
+ espn_home_team,
+ event_id,
+ )
+ else:
+ try:
+ db.conn.execute(
+ """
+ INSERT INTO events
+ (event_id, home_team, away_team,
+ commence_time, sport_key, created_at,
+ has_odds)
+ VALUES (?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ event_id,
+ odds_home_team,
+ odds_away_team,
+ commence_time,
+ "basketball_ncaab",
+ datetime.now().isoformat(),
+ 0,
+ ),
+ )
+ events_created += 1
+ logger.info(
+ " Created event: %s @ %s (event_id: %s)",
+ odds_away_team,
+ odds_home_team,
+ event_id,
+ )
+ except Exception as e:
+ logger.debug(" Event creation failed (may exist): %s", e)
+ else:
+ event_id = matching_event.iloc[0]["event_id"]
+
+ if dry_run:
+ logger.info(
+ " [DRY RUN] Would store: %s %s @ %s %s (event_id: %s)",
+ espn_away_team,
+ away_score,
+ espn_home_team,
+ home_score,
+ event_id,
+ )
+ scores_stored += 1
+ continue
+
+ # Check if score already exists
+ existing = db.conn.execute(
+ "SELECT event_id FROM scores WHERE event_id = ?",
+ (event_id,),
+ ).fetchone()
+
+ now_iso = datetime.now().isoformat()
+ if existing:
+ db.conn.execute(
+ """
+ UPDATE scores
+ SET sport_key = ?, completed = ?,
+ home_score = ?, away_score = ?,
+ last_update = ?, fetched_at = ?
+ WHERE event_id = ?
+ """,
+ (
+ "basketball_ncaab",
+ 1,
+ int(home_score),
+ int(away_score),
+ now_iso,
+ now_iso,
+ event_id,
+ ),
+ )
+ scores_updated += 1
+ else:
+ db.conn.execute(
+ """
+ INSERT INTO scores
+ (event_id, sport_key, completed,
+ home_score, away_score,
+ last_update, fetched_at)
+ VALUES (?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ event_id,
+ "basketball_ncaab",
+ 1,
+ int(home_score),
+ int(away_score),
+ now_iso,
+ now_iso,
+ ),
+ )
+ scores_stored += 1
+
+ if not dry_run:
+ db.conn.commit()
+
+ if events_created > 0:
+ logger.info(" [OK] Created %d new events from ESPN", events_created)
+ if scores_stored > 0:
+ logger.info(" [OK] Stored %d new scores from ESPN", scores_stored)
+ if scores_updated > 0:
+ logger.info(
+ " [OK] Updated %d existing scores from ESPN",
+ scores_updated,
+ )
+ if match_failures:
+ logger.warning(
+ " Could not match %d ESPN games to our events:",
+ len(match_failures),
+ )
+ for failure in match_failures[:10]:
+ logger.warning(" - %s", failure)
+
+ return {
+ "events_created": events_created,
+ "scores_stored": scores_stored,
+ "scores_updated": scores_updated,
+ "scores_skipped": scores_skipped,
+ "match_failures": len(match_failures),
+ }
+
+
+async def backfill_date_range(
+ db_path: Path,
+ team_mapping_path: Path,
+ start_date: date,
+ end_date: date,
+ dry_run: bool = False,
+) -> None:
+ """Backfill scores from ESPN for a date range.
+
+ Args:
+ db_path: Path to SQLite database
+ team_mapping_path: Path to team mapping file
+ start_date: Start date (inclusive)
+ end_date: End date (inclusive)
+ dry_run: If True, show what would be fetched without storing
+ """
+ if start_date > end_date:
+ logger.error(f"Invalid date range: {start_date} > {end_date}")
+ sys.exit(1)
+
+ if end_date > date.today():
+ logger.warning(f"End date {end_date} is in the future, limiting to today")
+ end_date = date.today()
+
+ num_days = (end_date - start_date).days + 1
+
+ logger.info("=" * 80)
+ logger.info("ESPN SCORES BACKFILL")
+ logger.info("=" * 80)
+ logger.info(f"Date range: {start_date} to {end_date}")
+ logger.info(f"Total days: {num_days}")
+ logger.info(f"Dry run: {dry_run}")
+ logger.info("=" * 80)
+
+ if dry_run:
+ logger.info("\n[DRY RUN] No changes will be made to database\n")
+
+ # Load team mapper
+ try:
+ mapping_df = read_parquet_df(team_mapping_path)
+ team_mapper = TeamMapper(mapping_df)
+ logger.info(f"Loaded team mapping: {len(mapping_df)} teams")
+ except FileNotFoundError:
+ logger.error(f"Team mapping not found: {team_mapping_path}")
+ logger.error("Run scripts/create_team_mapping.py first")
+ sys.exit(1)
+
+ # Initialize database
+ db = OddsAPIDatabase(db_path)
+
+ try:
+ # Bulk fetch all events using date-range chunks
+ logger.info("Fetching all ESPN events in bulk (7-day chunks)...")
+ async with ESPNClient() as espn:
+ all_events = await espn.fetch_scoreboard_range(start_date, end_date)
+ logger.info("Total ESPN events fetched: %d", len(all_events))
+
+ # Group events by date for per-date database matching
+ from collections import defaultdict
+
+ events_by_date: dict[date, list[dict[str, Any]]] = defaultdict(list)
+ for event in all_events:
+ event_date_str = event.get("date", "")
+ if event_date_str:
+ try:
+ event_date = datetime.fromisoformat(
+ event_date_str.replace("Z", "+00:00")
+ ).date()
+ events_by_date[event_date].append(event)
+ except (ValueError, TypeError):
+ continue
+
+ total_events_created = 0
+ total_stored = 0
+ total_updated = 0
+ total_match_failures = 0
+
+ current_date = start_date
+ while current_date <= end_date:
+ date_events = events_by_date.get(current_date, [])
+ if date_events:
+ metrics = store_espn_scores(
+ date_events,
+ db,
+ team_mapper,
+ current_date,
+ dry_run,
+ )
+ total_events_created += metrics.get("events_created", 0)
+ total_stored += metrics.get("scores_stored", 0)
+ total_updated += metrics.get("scores_updated", 0)
+ total_match_failures += metrics.get("match_failures", 0)
+ current_date += timedelta(days=1)
+
+ logger.info("\n" + "=" * 80)
+ logger.info("ESPN BACKFILL SUMMARY")
+ logger.info("=" * 80)
+ logger.info("Date range: %s to %s", start_date, end_date)
+ logger.info("Days processed: %d", num_days)
+ logger.info("Events created: %d", total_events_created)
+ logger.info("New scores stored: %d", total_stored)
+ logger.info("Existing scores updated: %d", total_updated)
+ logger.info("Total scores: %d", total_stored + total_updated)
+ logger.info("Match failures: %d", total_match_failures)
+
+ if total_match_failures > 0:
+ logger.warning(
+ "\n[WARNING] %d ESPN games could not be matched "
+ "to our events (likely team name mapping issues)",
+ total_match_failures,
+ )
+
+ if not dry_run:
+ logger.info("\n[OK] ESPN backfill complete!")
+ else:
+ logger.info("\n[DRY RUN] Complete (no changes made)")
+
+ finally:
+ db.close()
+
+
+def main() -> None:
+ """Run ESPN scores backfill."""
+ parser = argparse.ArgumentParser(description="Backfill historical scores from ESPN")
+ parser.add_argument(
+ "--db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to SQLite database",
+ )
+ parser.add_argument(
+ "--team-mapping",
+ type=Path,
+ default=Path("data/staging/mappings/team_mapping.parquet"),
+ help="Path to team mapping file",
+ )
+ parser.add_argument(
+ "--start",
+ type=lambda s: datetime.fromisoformat(s).date(),
+ help="Start date (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--end",
+ type=lambda s: datetime.fromisoformat(s).date(),
+ help="End date (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--auto-detect",
+ action="store_true",
+ help="Auto-detect missing date range and backfill",
+ )
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Show what would be fetched without making changes",
+ )
+
+ args = parser.parse_args()
+
+ # Ensure log directory exists
+ Path("data/logs").mkdir(parents=True, exist_ok=True)
+
+ # Determine date range
+ if args.auto_detect:
+ logger.info("Auto-detecting missing date range...")
+ date_range = detect_missing_date_range(args.db)
+
+ if date_range is None:
+ logger.info("[OK] No backfill needed!")
+ return
+
+ start_date, end_date = date_range
+ logger.info(f"Detected gap: {start_date} to {end_date}")
+
+ elif args.start and args.end:
+ start_date = args.start
+ end_date = args.end
+
+ else:
+ logger.error("Must specify either --auto-detect or both --start and --end")
+ parser.print_help()
+ sys.exit(1)
+
+ # Run backfill
+ try:
+ asyncio.run(
+ backfill_date_range(
+ args.db,
+ args.team_mapping,
+ start_date,
+ end_date,
+ args.dry_run,
+ )
+ )
+ except Exception as e:
+ logger.error(f"Backfill failed: {e}", exc_info=True)
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/collection/backfill_historical_scores.py b/scripts/collection/backfill_historical_scores.py
new file mode 100644
index 000000000..0a4384da2
--- /dev/null
+++ b/scripts/collection/backfill_historical_scores.py
@@ -0,0 +1,477 @@
+"""Backfill historical scores from The Odds API.
+
+Fills gaps in score collection for past games where we have odds but no outcomes.
+
+Usage:
+ # Backfill specific date range
+ uv run python scripts/backfill_historical_scores.py \
+ --start 2025-12-28 \
+ --end 2026-01-23
+
+ # Dry run (show what would be fetched without making API calls)
+ uv run python scripts/backfill_historical_scores.py \
+ --start 2025-12-28 \
+ --end 2026-01-23 \
+ --dry-run
+
+ # Auto-detect gaps and backfill
+ uv run python scripts/backfill_historical_scores.py --auto-detect
+
+Environment:
+ ODDS_API_KEY: Required - Your Odds API key
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import os
+import sys
+from datetime import date, datetime, timedelta
+from pathlib import Path
+from typing import Any
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s - %(levelname)s - %(message)s",
+ handlers=[
+ logging.StreamHandler(sys.stdout),
+ logging.FileHandler("data/logs/backfill_scores.log"),
+ ],
+)
+logger = logging.getLogger(__name__)
+
+
+def check_api_key() -> str:
+ """Check that ODDS_API_KEY environment variable is set.
+
+ Returns:
+ API key
+
+ Raises:
+ SystemExit: If API key not found
+ """
+ api_key = os.getenv("ODDS_API_KEY")
+ if not api_key:
+ logger.error("ODDS_API_KEY environment variable not set!")
+ logger.error("Set it with: export ODDS_API_KEY='your_key_here'")
+ sys.exit(1)
+ return api_key
+
+
+def detect_missing_date_range(db_path: Path) -> tuple[date, date] | None:
+ """Detect date range with missing scores.
+
+ Args:
+ db_path: Path to SQLite database
+
+ Returns:
+ (start_date, end_date) tuple if gaps found, None otherwise
+ """
+ db = OddsAPIDatabase(str(db_path))
+
+ try:
+ # Find earliest event and earliest score
+ query = """
+ WITH event_dates AS (
+ SELECT MIN(DATE(commence_time)) as earliest_event,
+ MAX(DATE(commence_time)) as latest_event
+ FROM events
+ WHERE DATE(commence_time) < DATE('now')
+ ),
+ score_dates AS (
+ SELECT MIN(DATE(e.commence_time)) as earliest_score,
+ MAX(DATE(e.commence_time)) as latest_score
+ FROM scores s
+ INNER JOIN events e ON s.event_id = e.event_id
+ WHERE s.completed = 1
+ )
+ SELECT
+ event_dates.earliest_event,
+ score_dates.earliest_score,
+ event_dates.latest_event
+ FROM event_dates, score_dates
+ """
+
+ result = db.conn.execute(query).fetchone()
+
+ if not result:
+ logger.warning("No events found in database")
+ return None
+
+ earliest_event, earliest_score, latest_event = result
+
+ if earliest_score is None:
+ # No scores at all
+ logger.info(f"No scores found. Need to backfill: {earliest_event} to {latest_event}")
+ return (
+ datetime.fromisoformat(earliest_event).date(),
+ datetime.fromisoformat(latest_event).date(),
+ )
+
+ # Check if there's a gap
+ earliest_event_date = datetime.fromisoformat(earliest_event).date()
+ earliest_score_date = datetime.fromisoformat(earliest_score).date()
+
+ if earliest_event_date < earliest_score_date:
+ # Gap detected
+ gap_end = earliest_score_date - timedelta(days=1)
+ logger.info(
+ f"Gap detected: {earliest_event_date} to {gap_end} "
+ f"(scores start at {earliest_score_date})"
+ )
+ return (earliest_event_date, gap_end)
+
+ logger.info("[OK] No gaps detected in score collection")
+ return None
+
+ finally:
+ db.close()
+
+
+def backfill_scores_for_date(
+ api_key: str,
+ db: OddsAPIDatabase,
+ target_date: date,
+ dry_run: bool = False,
+) -> dict[str, Any]:
+ """Fetch and store scores for a specific date.
+
+ Args:
+ api_key: Odds API key
+ db: Database adapter
+ target_date: Date to fetch scores for
+ dry_run: If True, log what would be fetched without making API calls
+
+ Returns:
+ Dictionary with collection metrics
+ """
+ base_url = "https://api.the-odds-api.com/v4"
+ sport = "basketball_ncaab"
+
+ # The Odds API uses daysFrom parameter to fetch historical scores
+ # IMPORTANT: The API has a limit on historical data (typically 3 days)
+ # We need to calculate days from today to target_date
+ days_ago = (date.today() - target_date).days
+
+ if days_ago < 0:
+ logger.warning(f"Cannot fetch future date: {target_date}")
+ return {"scores": 0, "skipped": True}
+
+ # Check if date is beyond API's historical limit
+ MAX_DAYS_BACK = 3 # Odds API typically only keeps last 3 days
+ if days_ago > MAX_DAYS_BACK:
+ logger.warning(
+ f"Cannot fetch {target_date} ({days_ago} days ago): "
+ f"Odds API only provides last {MAX_DAYS_BACK} days of scores"
+ )
+ return {"scores": 0, "too_old": True, "days_ago": days_ago}
+
+ url = f"{base_url}/sports/{sport}/scores/"
+ params: dict[str, str | int] = {
+ "apiKey": api_key,
+ "daysFrom": days_ago,
+ "dateFormat": "iso",
+ }
+
+ logger.info(f"Fetching scores for {target_date} (daysFrom={days_ago})...")
+
+ if dry_run:
+ logger.info(f"[DRY RUN] Would fetch: {url}")
+ logger.info(f"[DRY RUN] Parameters: {params}")
+ return {"scores": 0, "dry_run": True}
+
+ try:
+ import httpx
+
+ with httpx.Client(timeout=30.0) as client:
+ response = client.get(url, params=params)
+ response.raise_for_status()
+
+ # Check quota
+ remaining = response.headers.get("x-requests-remaining")
+ used = response.headers.get("x-requests-used")
+ logger.info(f"API Quota - Used: {used}, Remaining: {remaining}")
+
+ scores_data = response.json()
+
+ # Filter for completed games on target date
+ completed_games = [
+ g
+ for g in scores_data
+ if g.get("completed") is True
+ and datetime.fromisoformat(g["commence_time"].replace("Z", "+00:00")).date()
+ == target_date
+ ]
+
+ logger.info(f"Found {len(completed_games)} completed games on {target_date}")
+
+ # Store scores
+ scores_stored = 0
+ scores_updated = 0
+ scores_skipped = 0
+
+ for game in completed_games:
+ event_id = game["id"]
+ home_team = game["home_team"]
+ away_team = game["away_team"]
+
+ # Get scores from the 'scores' field
+ scores = game.get("scores")
+ if not scores or len(scores) < 2:
+ logger.debug(f"Skipping {event_id}: incomplete scores")
+ scores_skipped += 1
+ continue
+
+ # Find home and away scores
+ home_score = None
+ away_score = None
+
+ for score in scores:
+ if score["name"] == home_team:
+ home_score = score.get("score")
+ elif score["name"] == away_team:
+ away_score = score.get("score")
+
+ if home_score is not None and away_score is not None:
+ # Check if score already exists
+ existing = db.conn.execute(
+ "SELECT event_id FROM scores WHERE event_id = ?",
+ (event_id,),
+ ).fetchone()
+
+ if existing:
+ # Update existing score
+ db.conn.execute(
+ """
+ UPDATE scores
+ SET sport_key = ?,
+ completed = ?,
+ home_score = ?,
+ away_score = ?,
+ last_update = ?,
+ fetched_at = ?
+ WHERE event_id = ?
+ """,
+ (
+ "basketball_ncaab",
+ 1,
+ home_score,
+ away_score,
+ game.get("last_update", datetime.now().isoformat()),
+ datetime.now().isoformat(),
+ event_id,
+ ),
+ )
+ scores_updated += 1
+ else:
+ # Insert new score
+ db.conn.execute(
+ """
+ INSERT INTO scores
+ (event_id, sport_key, completed, home_score, away_score,
+ last_update, fetched_at)
+ VALUES (?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ event_id,
+ "basketball_ncaab",
+ 1,
+ home_score,
+ away_score,
+ game.get("last_update", datetime.now().isoformat()),
+ datetime.now().isoformat(),
+ ),
+ )
+ scores_stored += 1
+
+ db.conn.commit()
+
+ if scores_stored > 0:
+ logger.info(f"[OK] Stored {scores_stored} new scores for {target_date}")
+ if scores_updated > 0:
+ logger.info(f"[OK] Updated {scores_updated} existing scores for {target_date}")
+ if scores_skipped > 0:
+ logger.debug(f"Skipped {scores_skipped} incomplete scores")
+
+ return {
+ "scores_stored": scores_stored,
+ "scores_updated": scores_updated,
+ "scores_skipped": scores_skipped,
+ "quota_remaining": remaining,
+ }
+
+ except Exception as e:
+ import httpx
+
+ if isinstance(e, httpx.HTTPStatusError):
+ if e.response.status_code == 404:
+ logger.info(f"No scores available for {target_date}")
+ return {"scores_stored": 0, "not_found": True}
+ else:
+ logger.error(f"HTTP error for {target_date}: {e.response.status_code}")
+ logger.error(f"Response: {e.response.text}")
+ raise
+ elif isinstance(e, httpx.RequestError):
+ logger.error(f"Request error for {target_date}: {e}")
+ raise
+ else:
+ raise
+
+
+def backfill_date_range(
+ api_key: str,
+ db_path: Path,
+ start_date: date,
+ end_date: date,
+ dry_run: bool = False,
+) -> None:
+ """Backfill scores for a date range.
+
+ Args:
+ api_key: Odds API key
+ db_path: Path to SQLite database
+ start_date: Start date (inclusive)
+ end_date: End date (inclusive)
+ dry_run: If True, show what would be fetched without making API calls
+ """
+ # Validate date range
+ if start_date > end_date:
+ logger.error(f"Invalid date range: {start_date} > {end_date}")
+ sys.exit(1)
+
+ if end_date > date.today():
+ logger.warning(f"End date {end_date} is in the future, limiting to today")
+ end_date = date.today()
+
+ num_days = (end_date - start_date).days + 1
+
+ logger.info("=" * 80)
+ logger.info("HISTORICAL SCORES BACKFILL")
+ logger.info("=" * 80)
+ logger.info(f"Date range: {start_date} to {end_date}")
+ logger.info(f"Total days: {num_days}")
+ logger.info(f"Dry run: {dry_run}")
+ logger.info("=" * 80)
+
+ if dry_run:
+ logger.info("\n[DRY RUN] No API calls will be made\n")
+
+ # Initialize database
+ db = OddsAPIDatabase(db_path)
+
+ try:
+ current_date = start_date
+ total_stored = 0
+ total_updated = 0
+ total_skipped = 0
+ total_too_old = 0
+
+ while current_date <= end_date:
+ metrics = backfill_scores_for_date(api_key, db, current_date, dry_run)
+
+ total_stored += metrics.get("scores_stored", 0)
+ total_updated += metrics.get("scores_updated", 0)
+ total_skipped += metrics.get("scores_skipped", 0)
+ if metrics.get("too_old"):
+ total_too_old += 1
+
+ current_date += timedelta(days=1)
+
+ logger.info("\n" + "=" * 80)
+ logger.info("BACKFILL SUMMARY")
+ logger.info("=" * 80)
+ logger.info(f"Date range: {start_date} to {end_date}")
+ logger.info(f"Days processed: {num_days}")
+ logger.info(f"New scores stored: {total_stored}")
+ logger.info(f"Existing scores updated: {total_updated}")
+ logger.info(f"Incomplete scores skipped: {total_skipped}")
+ logger.info(f"Days beyond API limit: {total_too_old}")
+ logger.info(f"Total scores: {total_stored + total_updated}")
+
+ if total_too_old > 0:
+ logger.warning(
+ f"\n[WARNING] Could not backfill {total_too_old} days "
+ f"(beyond Odds API's 3-day historical limit)"
+ )
+
+ if not dry_run:
+ logger.info("\n[OK] Backfill complete!")
+ else:
+ logger.info("\n[DRY RUN] Complete (no changes made)")
+
+ finally:
+ db.close()
+
+
+def main() -> None:
+ """Run historical scores backfill."""
+ parser = argparse.ArgumentParser(description="Backfill historical scores from Odds API")
+ parser.add_argument(
+ "--db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to SQLite database",
+ )
+ parser.add_argument(
+ "--start",
+ type=lambda s: datetime.fromisoformat(s).date(),
+ help="Start date (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--end",
+ type=lambda s: datetime.fromisoformat(s).date(),
+ help="End date (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--auto-detect",
+ action="store_true",
+ help="Auto-detect missing date range and backfill",
+ )
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Show what would be fetched without making API calls",
+ )
+
+ args = parser.parse_args()
+
+ # Ensure log directory exists
+ Path("data/logs").mkdir(parents=True, exist_ok=True)
+
+ # Check API key
+ api_key = check_api_key()
+
+ # Determine date range
+ if args.auto_detect:
+ logger.info("Auto-detecting missing date range...")
+ date_range = detect_missing_date_range(args.db)
+
+ if date_range is None:
+ logger.info("[OK] No backfill needed!")
+ return
+
+ start_date, end_date = date_range
+ logger.info(f"Detected gap: {start_date} to {end_date}")
+
+ elif args.start and args.end:
+ start_date = args.start
+ end_date = args.end
+
+ else:
+ logger.error("Must specify either --auto-detect or both --start and --end")
+ parser.print_help()
+ sys.exit(1)
+
+ # Run backfill
+ try:
+ backfill_date_range(api_key, args.db, start_date, end_date, args.dry_run)
+ except Exception as e:
+ logger.error(f"Backfill failed: {e}", exc_info=True)
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/collection/backfill_odds_api_scores.py b/scripts/collection/backfill_odds_api_scores.py
new file mode 100644
index 000000000..b4d8de49d
--- /dev/null
+++ b/scripts/collection/backfill_odds_api_scores.py
@@ -0,0 +1,245 @@
+"""Backfill scores using The Odds API scores endpoint.
+
+This script uses the Odds API's /scores endpoint instead of ESPN,
+which provides exact event ID matching (no team name matching required).
+
+Usage:
+ python scripts/backfill_odds_api_scores.py --days-from 3
+ python scripts/backfill_odds_api_scores.py --days-from 2 --dry-run
+
+Cost:
+ 2 API credits per request (daysFrom parameter)
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+import sqlite3
+from pathlib import Path
+from typing import Any
+
+from sports_betting_edge.adapters.odds_api import OddsAPIAdapter
+from sports_betting_edge.config.logging import configure_logging
+
+logger = logging.getLogger(__name__)
+
+
+async def fetch_scores(adapter: OddsAPIAdapter, days_from: int = 3) -> list[dict[str, Any]]:
+ """Fetch scores from Odds API.
+
+ Args:
+ adapter: Odds API adapter
+ days_from: Number of days to look back (1-3)
+
+ Returns:
+ List of events with scores
+ """
+ logger.info(f"Fetching scores from Odds API (last {days_from} days)...")
+ scores = await adapter.get_ncaab_scores(days_from=days_from)
+ logger.info(f" Received {len(scores)} events from Odds API")
+ return scores
+
+
+def update_scores_in_database(
+ db_path: str, scores_data: list[dict[str, Any]], dry_run: bool = False
+) -> dict[str, int]:
+ """Update scores in database.
+
+ Args:
+ db_path: Path to SQLite database
+ scores_data: List of score events from Odds API
+ dry_run: If True, don't actually update database
+
+ Returns:
+ Dictionary with update statistics
+ """
+ stats = {
+ "total_received": len(scores_data),
+ "completed_games": 0,
+ "in_progress_games": 0,
+ "not_started": 0,
+ "new_scores": 0,
+ "updated_scores": 0,
+ "events_not_in_db": 0,
+ }
+
+ conn = sqlite3.connect(db_path)
+ cursor = conn.cursor()
+
+ for event in scores_data:
+ event_id = event.get("id")
+ completed = event.get("completed", False)
+ scores = event.get("scores") or [] # Handle None case
+
+ # Categorize event status
+ if completed:
+ stats["completed_games"] += 1
+ elif len(scores) > 0:
+ stats["in_progress_games"] += 1
+ else:
+ stats["not_started"] += 1
+
+ # Only process completed games with scores
+ if not completed or len(scores) == 0:
+ continue
+
+ # Extract scores
+ home_score = None
+ away_score = None
+ for score in scores:
+ team_name = score.get("name")
+ team_score = score.get("score")
+
+ if team_name == event.get("home_team"):
+ home_score = team_score
+ elif team_name == event.get("away_team"):
+ away_score = team_score
+
+ if home_score is None or away_score is None:
+ logger.warning(
+ f" Event {event_id}: Incomplete scores (home={home_score}, away={away_score})"
+ )
+ continue
+
+ # Check if event exists in our database
+ cursor.execute("SELECT event_id FROM events WHERE event_id = ?", (event_id,))
+ if cursor.fetchone() is None:
+ stats["events_not_in_db"] += 1
+ logger.debug(f" Event {event_id}: Not in database (skipping)")
+ continue
+
+ # Check if score already exists
+ cursor.execute("SELECT event_id FROM scores WHERE event_id = ?", (event_id,))
+ existing = cursor.fetchone()
+
+ if existing:
+ # Update existing score
+ if not dry_run:
+ cursor.execute(
+ """
+ UPDATE scores
+ SET home_score = ?, away_score = ?, completed = 1,
+ last_update = ?, fetched_at = CURRENT_TIMESTAMP
+ WHERE event_id = ?
+ """,
+ (home_score, away_score, event.get("last_update"), event_id),
+ )
+ stats["updated_scores"] += 1
+ home_team = event.get("home_team", "")[:25]
+ away_team = event.get("away_team", "")[:25]
+ logger.info(
+ f" [UPDATE] {event_id}: {home_team} {home_score} - {away_score} {away_team}"
+ )
+ else:
+ # Insert new score
+ if not dry_run:
+ cursor.execute(
+ """
+ INSERT INTO scores (
+ event_id, sport_key, completed,
+ home_score, away_score, last_update, fetched_at
+ ) VALUES (?, ?, ?, ?, ?, ?, CURRENT_TIMESTAMP)
+ """,
+ (
+ event_id,
+ event.get("sport_key", "basketball_ncaab"),
+ 1,
+ home_score,
+ away_score,
+ event.get("last_update"),
+ ),
+ )
+ stats["new_scores"] += 1
+ home_team = event.get("home_team", "")[:25]
+ away_team = event.get("away_team", "")[:25]
+ logger.info(f" [NEW] {event_id}: {home_team} {home_score} - {away_score} {away_team}")
+
+ if not dry_run:
+ conn.commit()
+ logger.info(
+ f" Committed {stats['new_scores'] + stats['updated_scores']} score updates to database"
+ )
+ else:
+ logger.info(
+ f" [DRY RUN] Would have updated {stats['new_scores'] + stats['updated_scores']} scores"
+ )
+
+ conn.close()
+ return stats
+
+
+async def main() -> None:
+ """Main backfill function."""
+ parser = argparse.ArgumentParser(description="Backfill scores from Odds API")
+ parser.add_argument(
+ "--days-from",
+ type=int,
+ default=3,
+ choices=[1, 2, 3],
+ help="Days to look back (1-3, default: 3)",
+ )
+ parser.add_argument(
+ "--db-path",
+ default="data/odds_api/odds_api.sqlite3",
+ help="Path to SQLite database",
+ )
+ parser.add_argument("--dry-run", action="store_true", help="Don't actually update database")
+ args = parser.parse_args()
+
+ logger.info("=" * 80)
+ logger.info("ODDS API SCORES BACKFILL")
+ logger.info("=" * 80)
+ logger.info(f"Days from: {args.days_from}")
+ logger.info(f"Database: {args.db_path}")
+ logger.info(f"Dry run: {args.dry_run}")
+ logger.info("=" * 80)
+
+ # Verify database exists
+ db_path = Path(args.db_path)
+ if not db_path.exists():
+ logger.error(f"Database not found: {db_path}")
+ return
+
+ # Initialize Odds API adapter
+ adapter = OddsAPIAdapter()
+
+ try:
+ # Fetch scores
+ scores_data = await fetch_scores(adapter, days_from=args.days_from)
+
+ # Update database
+ logger.info("\nUpdating scores in database...")
+ stats = update_scores_in_database(str(db_path), scores_data, dry_run=args.dry_run)
+
+ # Print summary
+ logger.info("\n" + "=" * 80)
+ logger.info("BACKFILL SUMMARY")
+ logger.info("=" * 80)
+ logger.info(f"Total events received: {stats['total_received']}")
+ logger.info(f" Completed games: {stats['completed_games']}")
+ logger.info(f" In progress: {stats['in_progress_games']}")
+ logger.info(f" Not started: {stats['not_started']}")
+ logger.info("\nScore updates:")
+ logger.info(f" New scores inserted: {stats['new_scores']}")
+ logger.info(f" Existing scores updated: {stats['updated_scores']}")
+ logger.info(f" Events not in database: {stats['events_not_in_db']}")
+ logger.info(f"\nTotal scores updated: {stats['new_scores'] + stats['updated_scores']}")
+
+ if args.dry_run:
+ logger.info("\n[DRY RUN] No changes were made to the database")
+ else:
+ logger.info("\n[OK] Scores backfill complete!")
+
+ # Check quota
+ if adapter._quota_remaining is not None:
+ logger.info(f"\nAPI Quota: {adapter._quota_remaining} requests remaining")
+
+ finally:
+ await adapter.close()
+
+
+if __name__ == "__main__":
+ configure_logging()
+ asyncio.run(main())
diff --git a/scripts/collection/collect_action_network.py b/scripts/collection/collect_action_network.py
new file mode 100644
index 000000000..c9b861ff8
--- /dev/null
+++ b/scripts/collection/collect_action_network.py
@@ -0,0 +1,125 @@
+"""Collect Action Network NCAAB scoreboard data to Parquet.
+
+Fetches public betting percentages, multi-book odds, and game data
+from Action Network's public API (no auth required).
+
+Usage:
+ uv run python scripts/collection/collect_action_network.py
+ uv run python scripts/collection/collect_action_network.py --date 2026-02-07
+ uv run python scripts/collection/collect_action_network.py \
+ --start-date 2025-11-04 --end-date 2026-02-08
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+import sys
+from datetime import date, datetime
+from pathlib import Path
+from zoneinfo import ZoneInfo
+
+from sports_betting_edge.services.action_network_collection import (
+ collect_to_parquet,
+)
+
+PST = ZoneInfo("America/Los_Angeles")
+
+
+def _log_setup() -> None:
+ log_dir = Path("data") / "logs"
+ log_dir.mkdir(parents=True, exist_ok=True)
+ log_path = log_dir / "action_network.log"
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+ datefmt="%Y-%m-%d %H:%M:%S",
+ handlers=[
+ logging.StreamHandler(sys.stdout),
+ logging.FileHandler(log_path, encoding="utf-8"),
+ ],
+ )
+
+
+def _parse_date(value: str) -> date:
+ return datetime.fromisoformat(value).date()
+
+
+def _default_date() -> date:
+ return datetime.now(PST).date()
+
+
+async def _run(
+ single_date: date | None,
+ start_date: date | None,
+ end_date: date | None,
+) -> int:
+ return await collect_to_parquet(
+ single_date=single_date,
+ start_date=start_date,
+ end_date=end_date,
+ )
+
+
+def main() -> int:
+ _log_setup()
+ logger = logging.getLogger(__name__)
+
+ parser = argparse.ArgumentParser(description="Collect Action Network NCAAB scoreboard data")
+ parser.add_argument(
+ "--date",
+ type=_parse_date,
+ help="Single date to collect (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--start-date",
+ type=_parse_date,
+ help="Start of date range (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--end-date",
+ type=_parse_date,
+ help="End of date range (YYYY-MM-DD)",
+ )
+ args = parser.parse_args()
+
+ # Determine mode
+ if args.start_date and args.end_date:
+ logger.info(
+ "Collecting Action Network data: %s to %s",
+ args.start_date.isoformat(),
+ args.end_date.isoformat(),
+ )
+ single_date = None
+ start_date = args.start_date
+ end_date = args.end_date
+ elif args.date:
+ logger.info(
+ "Collecting Action Network data for %s",
+ args.date.isoformat(),
+ )
+ single_date = args.date
+ start_date = None
+ end_date = None
+ else:
+ target = _default_date()
+ logger.info(
+ "Collecting Action Network data for today: %s",
+ target.isoformat(),
+ )
+ single_date = target
+ start_date = None
+ end_date = None
+
+ try:
+ count = asyncio.run(_run(single_date, start_date, end_date))
+ logger.info("Collected %d game(s)", count)
+ return 0
+ except Exception as exc: # noqa: BLE001
+ logger.exception("Failed to collect Action Network data: %s", exc)
+ return 1
+
+
+if __name__ == "__main__":
+ raise SystemExit(main())
diff --git a/scripts/collection/collect_daily_data.py b/scripts/collection/collect_daily_data.py
new file mode 100644
index 000000000..24e0e5866
--- /dev/null
+++ b/scripts/collection/collect_daily_data.py
@@ -0,0 +1,163 @@
+"""Collect daily KenPom and ESPN data.
+
+Usage:
+ uv run python scripts/collect_daily_data.py
+ uv run python scripts/collect_daily_data.py --date 2026-02-01
+"""
+
+import asyncio
+import json
+import logging
+import subprocess
+from datetime import date
+from pathlib import Path
+
+from sports_betting_edge.adapters.filesystem import write_parquet
+from sports_betting_edge.core.models import ESPNGame
+from sports_betting_edge.services.kenpom_collection import (
+ collect_kenpom_four_factors,
+ collect_kenpom_misc_stats,
+ collect_kenpom_ratings,
+)
+from sports_betting_edge.utils.time import utc_now
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+async def collect_kenpom_daily(season: int = 2026) -> None:
+ """Collect fresh KenPom data for current season.
+
+ Args:
+ season: Season year
+ """
+ logger.info(f"Collecting KenPom data for season {season}...")
+
+ # Collect ratings
+ try:
+ count = await collect_kenpom_ratings(season=season)
+ logger.info(f"[OK] Collected {count} team ratings")
+ except Exception as e:
+ logger.error(f"[ERROR] Failed to collect ratings: {e}")
+
+ # Collect four factors
+ try:
+ count = await collect_kenpom_four_factors(season=season)
+ logger.info(f"[OK] Collected {count} four factors records")
+ except Exception as e:
+ logger.error(f"[ERROR] Failed to collect four factors: {e}")
+
+ # Collect misc stats
+ try:
+ count = await collect_kenpom_misc_stats(season=season)
+ logger.info(f"[OK] Collected {count} misc stats records")
+ except Exception as e:
+ logger.error(f"[ERROR] Failed to collect misc stats: {e}")
+
+
+def collect_espn_schedule(target_date: date | None = None) -> None:
+ """Collect ESPN schedule for a specific date using web scraper.
+
+ Uses Puppeteer scraper to get complete schedule (25+ games) instead of
+ limited API (4 games).
+
+ Args:
+ target_date: Date to collect (default: today)
+ """
+ if target_date is None:
+ target_date = date.today()
+
+ logger.info(f"Collecting ESPN schedule for {target_date}...")
+
+ try:
+ # Format date for ESPN (YYYYMMDD)
+ date_str = target_date.strftime("%Y%m%d")
+
+ # Create temp file for scraper output
+ temp_json = Path(f"data/espn/schedule/{target_date}-temp.json")
+ temp_json.parent.mkdir(parents=True, exist_ok=True)
+
+ # Run Puppeteer scraper
+ puppeteer_script = Path("puppeteer/capture_espn_full_schedule.js")
+ cmd = ["node", str(puppeteer_script), "--date", date_str, "--output", str(temp_json)]
+
+ logger.info(f"Running ESPN scraper: {' '.join(cmd)}")
+ result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
+
+ if result.returncode != 0:
+ logger.error(f"Scraper failed: {result.stderr}")
+ return
+
+ # Load scraped data
+ with open(temp_json) as f:
+ scraped = json.load(f)
+
+ games_raw = scraped.get("games", [])
+ if not games_raw:
+ logger.warning(f"No games found for {target_date}")
+ return
+
+ # Convert to ESPNGame models
+ captured_at = utc_now()
+ games = []
+ for g in games_raw:
+ # Construct game_id from team IDs (ESPN doesn't provide it)
+ game_id = (
+ f"{g.get('away_team_id', 'unknown')}-{g.get('home_team_id', 'unknown')}-{date_str}"
+ )
+
+ game = ESPNGame(
+ game_id=game_id,
+ home_team_id=g.get("home_team_id", ""),
+ away_team_id=g.get("away_team_id", ""),
+ home_team=g.get("home_team", ""),
+ away_team=g.get("away_team", ""),
+ game_date=captured_at, # Scheduled date
+ status="Scheduled",
+ home_score=None,
+ away_score=None,
+ captured_at=captured_at,
+ )
+ games.append(game)
+
+ # Write to Parquet
+ output_dir = Path("data/espn/schedule")
+ output_path = output_dir / f"{target_date}.parquet"
+
+ data = [game.model_dump(mode="json") for game in games]
+ write_parquet(data, output_path)
+
+ # Clean up temp file
+ temp_json.unlink()
+
+ logger.info(f"[OK] Collected {len(games)} games to {output_path}")
+
+ except subprocess.TimeoutExpired:
+ logger.error("ESPN scraper timed out after 60 seconds")
+ except Exception as e:
+ logger.error(f"[ERROR] Failed to collect ESPN schedule: {e}")
+
+
+async def main() -> None:
+ """Collect all daily data."""
+ import sys
+
+ # Parse optional date argument
+ target_date = None
+ if len(sys.argv) > 2 and sys.argv[1] == "--date":
+ date_str = sys.argv[2]
+ target_date = date.fromisoformat(date_str)
+
+ logger.info("Starting daily data collection...")
+
+ # Collect KenPom data (async)
+ await collect_kenpom_daily(season=2026)
+
+ # Collect ESPN schedule (sync - uses Puppeteer)
+ collect_espn_schedule(target_date=target_date)
+
+ logger.info("Daily data collection complete!")
+
+
+if __name__ == "__main__":
+ asyncio.run(main())
diff --git a/scripts/collection/collect_daily_figures_scheduled.py b/scripts/collection/collect_daily_figures_scheduled.py
new file mode 100644
index 000000000..3da0f3508
--- /dev/null
+++ b/scripts/collection/collect_daily_figures_scheduled.py
@@ -0,0 +1,69 @@
+"""Collect Overtime.ag daily figures and store to Parquet (scheduled run).
+
+Usage:
+ uv run python scripts/collect_daily_figures_scheduled.py --user-data-dir
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import sys
+from pathlib import Path
+
+from sports_betting_edge.adapters.overtime import run_capture_daily_figures
+from sports_betting_edge.services.overtime_daily_figures import collect_daily_figures_to_parquet
+from sports_betting_edge.utils.time import utc_now
+
+
+def _log_setup() -> None:
+ log_dir = Path("data") / "logs"
+ log_dir.mkdir(parents=True, exist_ok=True)
+ log_path = log_dir / "daily_figures.log"
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+ datefmt="%Y-%m-%d %H:%M:%S",
+ handlers=[
+ logging.StreamHandler(sys.stdout),
+ logging.FileHandler(log_path, encoding="utf-8"),
+ ],
+ )
+
+
+def _build_output_path() -> Path:
+ stamp = utc_now().strftime("%Y-%m-%d_%H-%M-%S")
+ out_dir = Path("data") / "overtime" / "raw" / "daily_figures"
+ out_dir.mkdir(parents=True, exist_ok=True)
+ return out_dir / f"daily_figures_{stamp}.json"
+
+
+def main() -> int:
+ _log_setup()
+ logger = logging.getLogger(__name__)
+
+ parser = argparse.ArgumentParser(description="Collect Overtime.ag daily figures")
+ parser.add_argument("--user-data-dir", type=str, required=True)
+ parser.add_argument("--headless", type=str, default="true")
+ args = parser.parse_args()
+
+ headless = args.headless.lower() not in {"false", "0", "no"}
+ output_path = _build_output_path()
+
+ try:
+ logger.info("Starting daily figures capture")
+ run_capture_daily_figures(
+ output_path=output_path,
+ headless=headless,
+ user_data_dir=args.user_data_dir,
+ )
+ weeks, outcomes = collect_daily_figures_to_parquet(output_path)
+ logger.info("Captured %d week summary row(s), %d bet outcome(s)", weeks, outcomes)
+ return 0
+ except Exception as exc: # noqa: BLE001
+ logger.exception("Daily figures collection failed: %s", exc)
+ return 1
+
+
+if __name__ == "__main__":
+ raise SystemExit(main())
diff --git a/scripts/collection/collect_espn_schedule_daily.py b/scripts/collection/collect_espn_schedule_daily.py
new file mode 100644
index 000000000..4c611f0f9
--- /dev/null
+++ b/scripts/collection/collect_espn_schedule_daily.py
@@ -0,0 +1,71 @@
+"""Collect ESPN schedule for the current day (PST) and write to Parquet.
+
+Usage:
+ uv run python scripts/collect_espn_schedule_daily.py
+ uv run python scripts/collect_espn_schedule_daily.py --date 2026-02-04
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+import sys
+from datetime import date, datetime
+from pathlib import Path
+from zoneinfo import ZoneInfo
+
+from sports_betting_edge.services.espn_schedule_collection import collect_schedule_to_parquet
+
+PST = ZoneInfo("America/Los_Angeles")
+
+
+def _log_setup() -> None:
+ log_dir = Path("data") / "logs"
+ log_dir.mkdir(parents=True, exist_ok=True)
+ log_path = log_dir / "espn_schedule.log"
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+ datefmt="%Y-%m-%d %H:%M:%S",
+ handlers=[
+ logging.StreamHandler(sys.stdout),
+ logging.FileHandler(log_path, encoding="utf-8"),
+ ],
+ )
+
+
+def _parse_date(value: str) -> date:
+ return datetime.fromisoformat(value).date()
+
+
+def _default_date() -> date:
+ return datetime.now(PST).date()
+
+
+async def _run(target_date: date) -> int:
+ return await collect_schedule_to_parquet(single_date=target_date)
+
+
+def main() -> int:
+ _log_setup()
+ logger = logging.getLogger(__name__)
+
+ parser = argparse.ArgumentParser(description="Collect ESPN schedule for a date")
+ parser.add_argument("--date", type=_parse_date, help="Target date (YYYY-MM-DD)")
+ args = parser.parse_args()
+
+ target_date = args.date or _default_date()
+ logger.info("Collecting ESPN schedule for %s", target_date.isoformat())
+
+ try:
+ count = asyncio.run(_run(target_date))
+ logger.info("Collected %d game(s)", count)
+ return 0
+ except Exception as exc: # noqa: BLE001
+ logger.exception("Failed to collect ESPN schedule: %s", exc)
+ return 1
+
+
+if __name__ == "__main__":
+ raise SystemExit(main())
diff --git a/scripts/collection/collect_espn_scores_recap.py b/scripts/collection/collect_espn_scores_recap.py
new file mode 100644
index 000000000..202b3d947
--- /dev/null
+++ b/scripts/collection/collect_espn_scores_recap.py
@@ -0,0 +1,247 @@
+"""Collect ESPN scores for a single date and update Odds API SQLite DB.
+
+Defaults to yesterday in PST to capture end-of-day results.
+
+Usage:
+ uv run python scripts/collect_espn_scores_recap.py
+ uv run python scripts/collect_espn_scores_recap.py --date 2026-02-04
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+import sys
+from datetime import UTC, date, datetime, timedelta
+from pathlib import Path
+from zoneinfo import ZoneInfo
+
+import pandas as pd
+
+from sports_betting_edge.adapters.espn import fetch_scoreboard, parse_espn_score
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.core.team_mapper import TeamMapper
+
+PST = ZoneInfo("America/Los_Angeles")
+
+
+def _log_setup() -> None:
+ log_dir = Path("data") / "logs"
+ log_dir.mkdir(parents=True, exist_ok=True)
+ log_path = log_dir / "scores_recap.log"
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+ datefmt="%Y-%m-%d %H:%M:%S",
+ handlers=[
+ logging.StreamHandler(sys.stdout),
+ logging.FileHandler(log_path, encoding="utf-8"),
+ ],
+ )
+
+
+def _parse_date(value: str) -> date:
+ return datetime.fromisoformat(value).date()
+
+
+def _default_date() -> date:
+ today_pst = datetime.now(PST).date()
+ return today_pst - timedelta(days=1)
+
+
+async def _collect_scores_for_date(
+ target_date: date,
+ db_path: Path,
+ team_mapping_path: Path,
+ *,
+ include_in_progress: bool,
+) -> dict[str, int]:
+ logger = logging.getLogger(__name__)
+
+ mapping_df = read_parquet_df(str(team_mapping_path))
+ team_mapper = TeamMapper(mapping_df)
+ db = OddsAPIDatabase(db_path)
+
+ try:
+ scoreboard = await fetch_scoreboard(target_date)
+ events = scoreboard.get("events", [])
+ logger.info("ESPN scoreboard events: %d", len(events))
+
+ if not events:
+ return {"stored": 0, "updated": 0, "unmatched": 0}
+
+ our_events = pd.read_sql_query(
+ """
+ SELECT event_id, home_team, away_team, commence_time
+ FROM events
+ WHERE DATE(commence_time) = ?
+ """,
+ db.conn,
+ params=(target_date.strftime("%Y-%m-%d"),),
+ )
+
+ stored = 0
+ updated = 0
+ unmatched = 0
+ in_progress = 0
+
+ for espn_event in events:
+ score = parse_espn_score(espn_event)
+ if score is None:
+ continue
+
+ completed = score["completed"]
+ if not completed and not include_in_progress:
+ continue
+
+ espn_home_team = score["espn_home_team"]
+ espn_away_team = score["espn_away_team"]
+ home_score = score["home_score"]
+ away_score = score["away_score"]
+
+ odds_home_team = team_mapper.get_odds_api_name(
+ team_mapper.get_kenpom_name(espn_home_team, source="espn")
+ )
+ odds_away_team = team_mapper.get_odds_api_name(
+ team_mapper.get_kenpom_name(espn_away_team, source="espn")
+ )
+
+ matching_event = our_events[
+ (
+ (our_events["home_team"] == odds_home_team)
+ & (our_events["away_team"] == odds_away_team)
+ )
+ | (
+ (our_events["home_team"] == espn_home_team)
+ & (our_events["away_team"] == espn_away_team)
+ )
+ ]
+
+ if len(matching_event) == 0:
+ unmatched += 1
+ continue
+
+ event_id = matching_event.iloc[0]["event_id"]
+ now_iso = datetime.now(UTC).isoformat()
+ completed_flag = 1 if completed else 0
+
+ existing = db.conn.execute(
+ "SELECT event_id FROM scores WHERE event_id = ?",
+ (event_id,),
+ ).fetchone()
+
+ if existing:
+ db.conn.execute(
+ """
+ UPDATE scores
+ SET sport_key = ?,
+ completed = ?,
+ home_score = ?,
+ away_score = ?,
+ last_update = ?,
+ fetched_at = ?
+ WHERE event_id = ?
+ """,
+ (
+ "basketball_ncaab",
+ completed_flag,
+ int(home_score),
+ int(away_score),
+ now_iso,
+ now_iso,
+ event_id,
+ ),
+ )
+ updated += 1
+ else:
+ db.conn.execute(
+ """
+ INSERT INTO scores
+ (event_id, sport_key, completed, home_score, away_score,
+ last_update, fetched_at)
+ VALUES (?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ event_id,
+ "basketball_ncaab",
+ completed_flag,
+ int(home_score),
+ int(away_score),
+ now_iso,
+ now_iso,
+ ),
+ )
+ stored += 1
+ if not completed:
+ in_progress += 1
+
+ db.conn.commit()
+ return {
+ "stored": stored,
+ "updated": updated,
+ "unmatched": unmatched,
+ "in_progress": in_progress,
+ }
+ finally:
+ db.close()
+
+
+def main() -> int:
+ _log_setup()
+ logger = logging.getLogger(__name__)
+
+ parser = argparse.ArgumentParser(description="Collect ESPN scores for a date")
+ parser.add_argument("--date", type=_parse_date, help="Target date (YYYY-MM-DD)")
+ parser.add_argument(
+ "--use-today",
+ action="store_true",
+ help="Use today's date in PST (overrides --date)",
+ )
+ parser.add_argument(
+ "--include-in-progress",
+ action="store_true",
+ help="Also store in-progress games (live scores) with completed=0",
+ )
+ parser.add_argument(
+ "--db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to Odds API SQLite database",
+ )
+ parser.add_argument(
+ "--team-mapping",
+ type=Path,
+ default=Path("data/staging/mappings/team_mapping.parquet"),
+ help="Path to team mapping file",
+ )
+ args = parser.parse_args()
+
+ target_date = datetime.now(PST).date() if args.use_today else args.date or _default_date()
+ logger.info("Collecting ESPN scores for %s", target_date.isoformat())
+
+ try:
+ metrics = asyncio.run(
+ _collect_scores_for_date(
+ target_date=target_date,
+ db_path=args.db,
+ team_mapping_path=args.team_mapping,
+ include_in_progress=args.include_in_progress,
+ )
+ )
+ logger.info(
+ "Scores recap: stored=%d updated=%d unmatched=%d in_progress=%d",
+ metrics["stored"],
+ metrics["updated"],
+ metrics["unmatched"],
+ metrics["in_progress"],
+ )
+ return 0
+ except Exception as exc: # noqa: BLE001
+ logger.exception("Scores recap failed: %s", exc)
+ return 1
+
+
+if __name__ == "__main__":
+ raise SystemExit(main())
diff --git a/scripts/collection/collect_espn_season_results.py b/scripts/collection/collect_espn_season_results.py
new file mode 100644
index 000000000..6f58004ed
--- /dev/null
+++ b/scripts/collection/collect_espn_season_results.py
@@ -0,0 +1,128 @@
+"""Collect ESPN game results for 2026 NCAA Basketball season.
+
+Fetches completed games from ESPN scoreboard API for the full season
+(November 2025 - March 2026) and saves to parquet for matching with Odds API.
+
+Usage:
+ uv run python scripts/collect_espn_season_results.py \\
+ --start 2025-11-01 \\
+ --end 2026-03-31 \\
+ --output data/espn/season_results_2026.parquet
+"""
+
+import argparse
+import asyncio
+import logging
+from datetime import date, datetime
+from pathlib import Path
+
+from sports_betting_edge.adapters.filesystem import write_parquet
+from sports_betting_edge.services.espn_schedule_collection import collect_schedule_range
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+async def collect_season_results(
+ start_date: date,
+ end_date: date,
+ output_path: Path,
+) -> None:
+ """Collect all game results for a date range from ESPN.
+
+ Args:
+ start_date: First date to collect
+ end_date: Last date to collect
+ output_path: Path to save parquet file
+ """
+ logger.info(f"Collecting ESPN results from {start_date} to {end_date}...")
+
+ # Collect games for date range
+ games = await collect_schedule_range(
+ start_date=start_date,
+ end_date=end_date,
+ use_calendar=False, # Check every day for comprehensive coverage
+ )
+
+ logger.info(f"Collected {len(games)} total games")
+
+ # Filter to completed games only (have final scores)
+ completed_games = [g for g in games if g.home_score is not None and g.away_score is not None]
+
+ logger.info(f"Found {len(completed_games)} completed games with scores")
+
+ if not completed_games:
+ logger.warning("No completed games found in date range")
+ return
+
+ # Convert to dict format for parquet
+ games_data = []
+ for game in completed_games:
+ games_data.append(
+ {
+ "game_id": game.game_id,
+ "game_date": game.game_date.isoformat(),
+ "home_team": game.home_team,
+ "away_team": game.away_team,
+ "home_team_id": game.home_team_id,
+ "away_team_id": game.away_team_id,
+ "home_score": game.home_score,
+ "away_score": game.away_score,
+ "status": game.status,
+ "conference_id": game.conference_id,
+ "conference_name": game.conference_name,
+ "venue": game.venue,
+ "neutral_site": game.neutral_site,
+ "captured_at": game.captured_at.isoformat(),
+ }
+ )
+
+ # Save to parquet
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ write_parquet(games_data, output_path)
+
+ logger.info(f"[OK] Saved {len(games_data)} completed games to {output_path}")
+
+ # Show sample statistics
+ logger.info("\n=== Collection Statistics ===")
+ logger.info(f"Date range: {start_date} to {end_date}")
+ logger.info(f"Total games collected: {len(games)}")
+ logger.info(f"Completed games: {len(completed_games)}")
+ logger.info(f"Scheduled/in-progress: {len(games) - len(completed_games)}")
+
+ # Show date distribution
+ if completed_games:
+ dates = [g.game_date.date() for g in completed_games]
+ logger.info(f"First game: {min(dates)}")
+ logger.info(f"Last game: {max(dates)}")
+
+
+def main() -> None:
+ """Collect ESPN season results."""
+ parser = argparse.ArgumentParser(description="Collect ESPN game results for season")
+ parser.add_argument(
+ "--start",
+ type=lambda s: datetime.fromisoformat(s).date(),
+ default=date(2025, 11, 1),
+ help="Start date (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--end",
+ type=lambda s: datetime.fromisoformat(s).date(),
+ default=date(2026, 3, 31),
+ help="End date (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=Path("data/espn/season_results_2026.parquet"),
+ help="Output parquet file path",
+ )
+
+ args = parser.parse_args()
+
+ asyncio.run(collect_season_results(args.start, args.end, args.output))
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/collection/collect_hybrid.py b/scripts/collection/collect_hybrid.py
new file mode 100644
index 000000000..69d4c787d
--- /dev/null
+++ b/scripts/collection/collect_hybrid.py
@@ -0,0 +1,498 @@
+"""Hybrid ESPN + Odds API collection for comprehensive event coverage.
+
+Collection Strategy:
+1. Fetch ALL games from ESPN (comprehensive, free)
+2. Fetch odds from The Odds API (limited to games with betting lines)
+3. Match odds to ESPN events where available
+4. Store all events with source tracking
+
+Result: 100% event coverage + scores, partial odds coverage
+
+Usage:
+ # Collect all upcoming games and recent scores
+ uv run python scripts/collect_hybrid.py
+
+ # Collect specific date range from ESPN
+ uv run python scripts/collect_hybrid.py --espn-days 7
+
+Environment:
+ ODDS_API_KEY: Required - Your Odds API key
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+import os
+import sys
+from datetime import date, datetime, timedelta
+from pathlib import Path
+from typing import Any
+
+# Ensure log directory exists
+Path("data/logs").mkdir(parents=True, exist_ok=True)
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s - %(levelname)s - %(message)s",
+ handlers=[
+ logging.StreamHandler(sys.stdout),
+ logging.FileHandler("data/logs/hybrid_collection.log"),
+ ],
+)
+logger = logging.getLogger(__name__)
+
+
+def check_api_key() -> str:
+ """Check that ODDS_API_KEY environment variable is set.
+
+ Returns:
+ API key
+
+ Raises:
+ SystemExit: If API key not found
+ """
+ api_key = os.getenv("ODDS_API_KEY")
+ if not api_key:
+ logger.error("ODDS_API_KEY environment variable not set!")
+ logger.error("Set it with: export ODDS_API_KEY='your_key_here'")
+ sys.exit(1)
+ return api_key
+
+
+async def collect_espn_events(db_path: Path, days_forward: int = 7) -> dict[str, Any]:
+ """Collect comprehensive event list from ESPN.
+
+ Args:
+ db_path: Path to SQLite database
+ days_forward: Days forward to collect (default: 7)
+
+ Returns:
+ Collection metrics
+ """
+ from sports_betting_edge.adapters.espn import fetch_scoreboard
+ from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+ from sports_betting_edge.core.event_id import generate_event_id
+ from sports_betting_edge.core.team_mapper import TeamMapper
+
+ logger.info("=" * 80)
+ logger.info("STEP 1: COLLECTING EVENTS FROM ESPN (COMPREHENSIVE)")
+ logger.info("=" * 80)
+
+ # Load team mapper
+ try:
+ mapper = TeamMapper()
+ except FileNotFoundError:
+ logger.error("Team mapping not found. Run scripts/create_team_mapping.py first")
+ return {"events_stored": 0, "error": "No team mapping"}
+
+ db = OddsAPIDatabase(db_path)
+ events_stored = 0
+ events_updated = 0
+
+ try:
+ # Collect upcoming games
+ today = date.today()
+ end_date = today + timedelta(days=days_forward)
+
+ current_date = today
+ while current_date <= end_date:
+ logger.info(f"Fetching ESPN games for {current_date}...")
+
+ try:
+ scoreboard = await fetch_scoreboard(current_date)
+ espn_games = scoreboard.get("events", [])
+
+ logger.info(f" Found {len(espn_games)} games on ESPN")
+
+ for espn_event in espn_games:
+ # Extract game details
+ competitions = espn_event.get("competitions", [])
+ if not competitions:
+ continue
+
+ competition = competitions[0]
+ competitors = competition.get("competitors", [])
+ if len(competitors) != 2:
+ continue
+
+ home_comp = next((c for c in competitors if c.get("homeAway") == "home"), None)
+ away_comp = next((c for c in competitors if c.get("homeAway") == "away"), None)
+
+ if not home_comp or not away_comp:
+ continue
+
+ # Get team names
+ espn_home = home_comp.get("team", {}).get("displayName", "")
+ espn_away = away_comp.get("team", {}).get("displayName", "")
+
+ # Map to Odds API team names
+ kenpom_home = mapper.get_kenpom_name(espn_home, source="espn")
+ kenpom_away = mapper.get_kenpom_name(espn_away, source="espn")
+ odds_home = mapper.get_odds_api_name(kenpom_home)
+ odds_away = mapper.get_odds_api_name(kenpom_away)
+
+ # Get commence time
+ game_date = espn_event.get("date", "")
+ if not game_date:
+ continue
+
+ # Generate deterministic event ID
+ event_id = generate_event_id(odds_home, odds_away, game_date, source="espn")
+
+ # Check if event exists
+ existing = db.conn.execute(
+ "SELECT event_id FROM events WHERE event_id = ?",
+ (event_id,),
+ ).fetchone()
+
+ if existing:
+ # Update existing event
+ db.conn.execute(
+ """
+ UPDATE events
+ SET home_team = ?, away_team = ?, commence_time = ?
+ WHERE event_id = ?
+ """,
+ (odds_home, odds_away, game_date, event_id),
+ )
+ events_updated += 1
+ else:
+ # Insert new event
+ db.conn.execute(
+ """
+ INSERT INTO events
+ (event_id, sport_key, home_team, away_team, commence_time,
+ created_at, source, has_odds)
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ event_id,
+ "basketball_ncaab",
+ odds_home,
+ odds_away,
+ game_date,
+ datetime.now().isoformat(),
+ "espn",
+ 0, # Will be updated if odds found
+ ),
+ )
+ events_stored += 1
+
+ # Check for scores (completed games)
+ status = espn_event.get("status", {})
+ status_type = status.get("type", {})
+ completed = status_type.get("completed", False)
+
+ if completed:
+ home_score = home_comp.get("score")
+ away_score = away_comp.get("score")
+
+ if home_score is not None and away_score is not None:
+ db.conn.execute(
+ """
+ INSERT OR REPLACE INTO scores
+ (event_id, sport_key, completed, home_score, away_score,
+ last_update, fetched_at)
+ VALUES (?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ event_id,
+ "basketball_ncaab",
+ 1,
+ int(home_score),
+ int(away_score),
+ datetime.now().isoformat(),
+ datetime.now().isoformat(),
+ ),
+ )
+
+ db.conn.commit()
+
+ except Exception as e:
+ logger.error(f"Error fetching ESPN games for {current_date}: {e}")
+
+ # Rate limit
+ await asyncio.sleep(0.5)
+ current_date += timedelta(days=1)
+
+ logger.info("[OK] ESPN collection complete")
+ logger.info(f" New events: {events_stored}")
+ logger.info(f" Updated events: {events_updated}")
+
+ return {
+ "events_stored": events_stored,
+ "events_updated": events_updated,
+ }
+
+ finally:
+ db.close()
+
+
+def collect_odds_api_odds(api_key: str, db_path: Path) -> dict[str, Any]:
+ """Collect odds from The Odds API and match to existing events.
+
+ Args:
+ api_key: Odds API key
+ db_path: Path to SQLite database
+
+ Returns:
+ Collection metrics
+ """
+ from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+ from sports_betting_edge.core.event_id import generate_event_id
+
+ logger.info("=" * 80)
+ logger.info("STEP 2: COLLECTING ODDS FROM ODDS API (WHERE AVAILABLE)")
+ logger.info("=" * 80)
+
+ base_url = "https://api.the-odds-api.com/v4"
+ sport = "basketball_ncaab"
+ regions = "us,us2"
+ markets = "h2h,spreads,totals"
+
+ url = f"{base_url}/sports/{sport}/odds/"
+ params = {
+ "apiKey": api_key,
+ "regions": regions,
+ "markets": markets,
+ "oddsFormat": "american",
+ }
+
+ logger.info(f"Fetching odds from {url}")
+
+ try:
+ import httpx
+
+ with httpx.Client(timeout=30.0) as client:
+ response = client.get(url, params=params)
+ response.raise_for_status()
+
+ remaining = response.headers.get("x-requests-remaining")
+ used = response.headers.get("x-requests-used")
+ logger.info(f"API Quota - Used: {used}, Remaining: {remaining}")
+
+ odds_data = response.json()
+ logger.info(f"Retrieved {len(odds_data)} events with odds")
+
+ if len(odds_data) == 0:
+ return {
+ "odds_api_events": 0,
+ "matched_events": 0,
+ "new_events": 0,
+ "observations": 0,
+ "quota_remaining": remaining,
+ }
+
+ db = OddsAPIDatabase(db_path)
+ try:
+ odds_api_events = 0
+ matched_events = 0
+ new_events = 0
+ observations_stored = 0
+ as_of = datetime.now().isoformat()
+
+ for event in odds_data:
+ odds_api_event_id = event["id"]
+ home_team = event["home_team"]
+ away_team = event["away_team"]
+ commence_time = event["commence_time"]
+
+ # Generate ESPN-style event ID for matching
+ espn_event_id = generate_event_id(
+ home_team, away_team, commence_time, source="espn"
+ )
+
+ # Check if we have this event from ESPN
+ existing_espn = db.conn.execute(
+ "SELECT event_id FROM events WHERE event_id = ?",
+ (espn_event_id,),
+ ).fetchone()
+
+ # Check if we have this event from Odds API
+ existing_odds_api = db.conn.execute(
+ "SELECT event_id FROM events WHERE event_id = ?",
+ (odds_api_event_id,),
+ ).fetchone()
+
+ # Determine which event ID to use
+ if existing_espn:
+ # Use ESPN event ID (preferred - comprehensive coverage)
+ final_event_id = espn_event_id
+ matched_events += 1
+
+ # Update has_odds flag
+ db.conn.execute(
+ "UPDATE events SET has_odds = 1 WHERE event_id = ?",
+ (final_event_id,),
+ )
+ elif existing_odds_api:
+ # Already have this Odds API event
+ final_event_id = odds_api_event_id
+ matched_events += 1
+ else:
+ # New event not in ESPN data (shouldn't happen often)
+ final_event_id = odds_api_event_id
+ new_events += 1
+
+ # Store new event
+ db.conn.execute(
+ """
+ INSERT INTO events
+ (event_id, sport_key, home_team, away_team, commence_time,
+ created_at, source, has_odds)
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ final_event_id,
+ sport,
+ home_team,
+ away_team,
+ commence_time,
+ as_of,
+ "odds_api",
+ 1,
+ ),
+ )
+
+ odds_api_events += 1
+
+ # Store odds observations
+ for bookmaker in event.get("bookmakers", []):
+ book_key = bookmaker["key"]
+
+ for market in bookmaker.get("markets", []):
+ market_key = market["key"]
+
+ for outcome in market.get("outcomes", []):
+ outcome_name = outcome["name"]
+ price = outcome.get("price")
+ point = outcome.get("point")
+
+ db.conn.execute(
+ """
+ INSERT INTO observations
+ (event_id, book_key, market_key, outcome_name,
+ price_american, point, as_of, fetched_at, sport_key)
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
+ """,
+ (
+ final_event_id,
+ book_key,
+ market_key,
+ outcome_name,
+ price,
+ point,
+ as_of,
+ as_of,
+ sport,
+ ),
+ )
+ observations_stored += 1
+
+ db.conn.commit()
+
+ logger.info("[OK] Odds API collection complete")
+ logger.info(f" Odds API events: {odds_api_events}")
+ logger.info(f" Matched to ESPN: {matched_events}")
+ logger.info(f" New events: {new_events}")
+ logger.info(f" Observations stored: {observations_stored}")
+
+ return {
+ "odds_api_events": odds_api_events,
+ "matched_events": matched_events,
+ "new_events": new_events,
+ "observations": observations_stored,
+ "quota_remaining": remaining,
+ }
+
+ finally:
+ db.close()
+
+ except Exception as e:
+ import httpx
+
+ if isinstance(e, httpx.HTTPStatusError):
+ logger.error(f"HTTP error: {e.response.status_code}")
+ logger.error(f"Response: {e.response.text}")
+ elif isinstance(e, httpx.RequestError):
+ logger.error(f"Request error: {e}")
+ raise
+
+
+async def main_async(args: argparse.Namespace) -> None:
+ """Run hybrid collection (async main).
+
+ Args:
+ args: Command line arguments
+ """
+ # Step 1: Collect comprehensive event list from ESPN
+ espn_metrics = await collect_espn_events(args.db, args.espn_days)
+
+ # Step 2: Collect odds from The Odds API (where available)
+ if not args.skip_odds:
+ api_key = check_api_key()
+ odds_metrics = collect_odds_api_odds(api_key, args.db)
+ else:
+ odds_metrics = {"odds_api_events": 0, "observations": 0}
+
+ # Summary
+ logger.info("")
+ logger.info("=" * 80)
+ logger.info("HYBRID COLLECTION SUMMARY")
+ logger.info("=" * 80)
+ logger.info(f"Timestamp: {datetime.now().isoformat()}")
+ logger.info("")
+ logger.info("ESPN (Comprehensive Event Coverage):")
+ logger.info(f" New events: {espn_metrics.get('events_stored', 0)}")
+ logger.info(f" Updated events: {espn_metrics.get('events_updated', 0)}")
+ logger.info("")
+ logger.info("Odds API (Betting Lines):")
+ logger.info(f" Events with odds: {odds_metrics.get('odds_api_events', 0)}")
+ logger.info(f" Matched to ESPN events: {odds_metrics.get('matched_events', 0)}")
+ logger.info(f" Observations stored: {odds_metrics.get('observations', 0)}")
+ logger.info(f" API quota remaining: {odds_metrics.get('quota_remaining', 'unknown')}")
+ logger.info("")
+ logger.info("[OK] Hybrid collection complete!")
+ logger.info("=" * 80)
+
+
+def main() -> None:
+ """Run hybrid collection (sync wrapper)."""
+ parser = argparse.ArgumentParser(
+ description="Hybrid ESPN + Odds API collection for comprehensive coverage"
+ )
+ parser.add_argument(
+ "--db",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Path to SQLite database",
+ )
+ parser.add_argument(
+ "--espn-days",
+ type=int,
+ default=7,
+ help="Days forward to collect from ESPN (default: 7)",
+ )
+ parser.add_argument(
+ "--skip-odds",
+ action="store_true",
+ help="Skip odds collection (ESPN events only)",
+ )
+
+ args = parser.parse_args()
+
+ logger.info("Starting hybrid collection...")
+ logger.info(f"Database: {args.db}")
+ logger.info("")
+
+ try:
+ asyncio.run(main_async(args))
+ except Exception as e:
+ logger.error(f"Collection failed: {e}", exc_info=True)
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/collection/collect_kenpom_for_pipeline.py b/scripts/collection/collect_kenpom_for_pipeline.py
new file mode 100644
index 000000000..0c732b6d5
--- /dev/null
+++ b/scripts/collection/collect_kenpom_for_pipeline.py
@@ -0,0 +1,293 @@
+"""Collect KenPom ratings for the daily data pipeline.
+
+Lightweight script for GitHub Actions workflow that fetches current season
+KenPom ratings and four factors data. Designed for daily automated collection.
+
+Usage:
+ # Collect current season ratings
+ uv run python scripts/collect_kenpom_for_pipeline.py
+
+ # Collect specific season
+ uv run python scripts/collect_kenpom_for_pipeline.py --season 2025
+
+Environment:
+ KENPOM_API_KEY: Required - Your KenPom API key
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+import os
+import sys
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s - %(levelname)s - %(message)s",
+ handlers=[logging.StreamHandler(sys.stdout)],
+)
+logger = logging.getLogger(__name__)
+
+
+def check_api_key() -> str | None:
+ """Check that KENPOM_API_KEY environment variable is set.
+
+ Returns:
+ API key or None if not found
+ """
+ api_key = os.getenv("KENPOM_API_KEY")
+ if not api_key:
+ logger.warning(
+ "KENPOM_API_KEY environment variable not set. KenPom data collection will be skipped."
+ )
+ logger.warning("To enable: export KENPOM_API_KEY='your_key_here'")
+ return None
+ return api_key
+
+
+async def collect_ratings(
+ output_dir: Path,
+ season: int,
+ api_key: str,
+) -> dict[str, int]:
+ """Collect KenPom ratings for a season.
+
+ Args:
+ output_dir: Directory to write ratings parquet file
+ season: Season year (e.g., 2026)
+ api_key: KenPom API key
+
+ Returns:
+ Collection metrics
+ """
+ from sports_betting_edge.adapters.kenpom import KenPomAdapter
+
+ logger.info("=" * 80)
+ logger.info(f"COLLECTING KENPOM RATINGS - SEASON {season}")
+ logger.info("=" * 80)
+
+ adapter = KenPomAdapter(api_key=api_key)
+
+ try:
+ # Fetch ratings
+ logger.info("Fetching team ratings...")
+ ratings_data = await adapter.get_ratings(season=season)
+
+ if not ratings_data:
+ logger.warning(f"No ratings data returned for season {season}")
+ return {"teams": 0}
+
+ # Convert to DataFrame
+ ratings_df = pd.DataFrame(ratings_data)
+ logger.info(f" Retrieved {len(ratings_df)} teams")
+
+ # Add fetch metadata
+ ratings_df["fetched_at"] = datetime.now().isoformat()
+
+ # Write to parquet
+ output_file = output_dir / f"ratings_{season}.parquet"
+ output_file.parent.mkdir(parents=True, exist_ok=True)
+ ratings_df.to_parquet(output_file, index=False)
+ logger.info(f" Wrote {output_file}")
+
+ # Log sample
+ if len(ratings_df) > 0:
+ top5 = ratings_df.nsmallest(5, "RankAdjEM")[["TeamName", "AdjEM", "RankAdjEM"]]
+ logger.info("\nTop 5 teams by AdjEM:")
+ for _, row in top5.iterrows():
+ logger.info(
+ f" {row['RankAdjEM']:3.0f}. {row['TeamName']:30s} ({row['AdjEM']:+.2f})"
+ )
+
+ return {"teams": len(ratings_df)}
+
+ finally:
+ await adapter.close()
+
+
+async def collect_four_factors(
+ output_dir: Path,
+ season: int,
+ api_key: str,
+) -> dict[str, int]:
+ """Collect KenPom four factors for a season.
+
+ Args:
+ output_dir: Directory to write four factors parquet file
+ season: Season year (e.g., 2026)
+ api_key: KenPom API key
+
+ Returns:
+ Collection metrics
+ """
+ from sports_betting_edge.adapters.kenpom import KenPomAdapter
+
+ logger.info("\nCollecting four factors...")
+
+ adapter = KenPomAdapter(api_key=api_key)
+
+ try:
+ # Fetch four factors
+ ff_data = await adapter.get_four_factors(season=season)
+
+ if not ff_data:
+ logger.warning(f"No four factors data returned for season {season}")
+ return {"teams": 0}
+
+ # Convert to DataFrame
+ ff_df = pd.DataFrame(ff_data)
+ logger.info(f" Retrieved {len(ff_df)} teams")
+
+ # Add fetch metadata
+ ff_df["fetched_at"] = datetime.now().isoformat()
+
+ # Write to parquet
+ output_file = output_dir / f"four-factors_{season}.parquet"
+ output_file.parent.mkdir(parents=True, exist_ok=True)
+ ff_df.to_parquet(output_file, index=False)
+ logger.info(f" Wrote {output_file}")
+
+ return {"teams": len(ff_df)}
+
+ finally:
+ await adapter.close()
+
+
+async def collect_hca(
+ output_dir: Path,
+ season: int,
+) -> dict[str, int]:
+ """Collect KenPom home court advantage data for a season.
+
+ Args:
+ output_dir: Directory to write HCA parquet file
+ season: Season year (e.g., 2026)
+
+ Returns:
+ Collection metrics
+ """
+ from sports_betting_edge.adapters.kenpom import KenPomAdapter
+
+ logger.info("\nCollecting home court advantage data...")
+
+ adapter = KenPomAdapter()
+
+ try:
+ hca_data = adapter.get_hca()
+
+ if not hca_data:
+ logger.warning(f"No HCA data returned for season {season}")
+ return {"teams": 0}
+
+ hca_df = pd.DataFrame(hca_data)
+ logger.info(f" Retrieved HCA for {len(hca_df)} teams")
+
+ hca_df["fetched_at"] = datetime.now().isoformat()
+
+ output_file = output_dir / f"hca_{season}.parquet"
+ output_file.parent.mkdir(parents=True, exist_ok=True)
+ hca_df.to_parquet(output_file, index=False)
+ logger.info(f" Wrote {output_file}")
+
+ return {"teams": len(hca_df)}
+
+ finally:
+ await adapter.close()
+
+
+async def main_async(args: argparse.Namespace) -> None:
+ """Run KenPom collection (async main).
+
+ Args:
+ args: Command line arguments
+ """
+ # Check for API key
+ api_key = check_api_key()
+ if not api_key:
+ logger.warning("\n[WARNING] Skipping KenPom collection - no API key found")
+ logger.warning(
+ "The pipeline will continue, but predictions may be degraded without KenPom data."
+ )
+ sys.exit(0) # Exit successfully to not fail the pipeline
+
+ start_time = datetime.now()
+
+ try:
+ # Collect ratings
+ ratings_dir = args.kenpom_dir / "ratings" / "season"
+ ratings_metrics = await collect_ratings(
+ output_dir=ratings_dir,
+ season=args.season,
+ api_key=api_key,
+ )
+
+ # Collect four factors
+ ff_dir = args.kenpom_dir / "four-factors" / "season"
+ ff_metrics = await collect_four_factors(
+ output_dir=ff_dir,
+ season=args.season,
+ api_key=api_key,
+ )
+
+ # Collect HCA (uses web scraping, no API key needed)
+ hca_dir = args.kenpom_dir / "hca" / "season"
+ hca_metrics = await collect_hca(
+ output_dir=hca_dir,
+ season=args.season,
+ )
+
+ # Summary
+ elapsed = (datetime.now() - start_time).total_seconds()
+ logger.info("")
+ logger.info("=" * 80)
+ logger.info("KENPOM COLLECTION SUMMARY")
+ logger.info("=" * 80)
+ logger.info(f"Season: {args.season}")
+ logger.info(f"Ratings: {ratings_metrics['teams']} teams")
+ logger.info(f"Four Factors: {ff_metrics['teams']} teams")
+ logger.info(f"HCA: {hca_metrics['teams']} teams")
+ logger.info(f"Elapsed: {elapsed:.1f}s")
+ logger.info("[OK] KenPom collection complete!")
+ logger.info("=" * 80)
+
+ except Exception as e:
+ logger.error(f"KenPom collection failed: {e}", exc_info=True)
+ logger.warning("\n[WARNING] KenPom collection failed but pipeline will continue")
+ logger.warning("Predictions may be degraded without KenPom data.")
+ sys.exit(0) # Exit successfully to not fail the pipeline
+
+
+def main() -> None:
+ """Run KenPom collection (sync wrapper)."""
+ parser = argparse.ArgumentParser(
+ description="Collect KenPom ratings and four factors for pipeline"
+ )
+ parser.add_argument(
+ "--kenpom-dir",
+ type=Path,
+ default=Path("data/kenpom"),
+ help="Path to KenPom data directory (default: data/kenpom)",
+ )
+ parser.add_argument(
+ "--season",
+ type=int,
+ default=2026,
+ help="Season year to collect (default: 2026)",
+ )
+
+ args = parser.parse_args()
+
+ logger.info(f"Starting KenPom collection for season {args.season}...")
+ logger.info(f"Output directory: {args.kenpom_dir}")
+ logger.info("")
+
+ asyncio.run(main_async(args))
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/collection/collect_open_bets_scheduled.py b/scripts/collection/collect_open_bets_scheduled.py
new file mode 100644
index 000000000..6cd9fd5d6
--- /dev/null
+++ b/scripts/collection/collect_open_bets_scheduled.py
@@ -0,0 +1,69 @@
+"""Collect Overtime.ag open bets and store to Parquet (scheduled run).
+
+Usage:
+ uv run python scripts/collect_open_bets_scheduled.py --user-data-dir
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import sys
+from pathlib import Path
+
+from sports_betting_edge.adapters.overtime import run_capture_open_bets
+from sports_betting_edge.services.overtime_open_bets import collect_open_bets_to_parquet
+from sports_betting_edge.utils.time import utc_now
+
+
+def _log_setup() -> None:
+ log_dir = Path("data") / "logs"
+ log_dir.mkdir(parents=True, exist_ok=True)
+ log_path = log_dir / "open_bets.log"
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+ datefmt="%Y-%m-%d %H:%M:%S",
+ handlers=[
+ logging.StreamHandler(sys.stdout),
+ logging.FileHandler(log_path, encoding="utf-8"),
+ ],
+ )
+
+
+def _build_output_path() -> Path:
+ stamp = utc_now().strftime("%Y-%m-%d_%H-%M-%S")
+ out_dir = Path("data") / "overtime" / "raw" / "open_bets"
+ out_dir.mkdir(parents=True, exist_ok=True)
+ return out_dir / f"open_bets_{stamp}.json"
+
+
+def main() -> int:
+ _log_setup()
+ logger = logging.getLogger(__name__)
+
+ parser = argparse.ArgumentParser(description="Collect Overtime.ag open bets")
+ parser.add_argument("--user-data-dir", type=str, required=True)
+ parser.add_argument("--headless", type=str, default="true")
+ args = parser.parse_args()
+
+ headless = args.headless.lower() not in {"false", "0", "no"}
+ output_path = _build_output_path()
+
+ try:
+ logger.info("Starting open bets capture")
+ run_capture_open_bets(
+ output_path=output_path,
+ headless=headless,
+ user_data_dir=args.user_data_dir,
+ )
+ count = collect_open_bets_to_parquet(output_path)
+ logger.info("Captured %d open bet(s)", count)
+ return 0
+ except Exception as exc: # noqa: BLE001
+ logger.exception("Open bets collection failed: %s", exc)
+ return 1
+
+
+if __name__ == "__main__":
+ raise SystemExit(main())
diff --git a/scripts/collection/collect_overtime_snapshot.py b/scripts/collection/collect_overtime_snapshot.py
new file mode 100644
index 000000000..1e800ceb4
--- /dev/null
+++ b/scripts/collection/collect_overtime_snapshot.py
@@ -0,0 +1,87 @@
+"""Collect Overtime.ag game lines snapshot for the daily pipeline.
+
+One-shot collection of all target sports (College Basketball, College Extra,
+NFL) via the Overtime REST API. Designed to run as a parallel Tier 2 job
+alongside Odds API, KenPom, and ESPN collection.
+
+Requires OV_CUSTOMER_ID and OV_PASSWORD in .env for Playwright auth.
+
+Usage:
+ uv run python scripts/collection/collect_overtime_snapshot.py
+"""
+
+from __future__ import annotations
+
+import asyncio
+import logging
+import sys
+import time
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s - %(levelname)s - %(message)s",
+ handlers=[logging.StreamHandler(sys.stdout)],
+)
+logger = logging.getLogger(__name__)
+
+
+def check_credentials() -> bool:
+ """Check that Overtime credentials are configured.
+
+ Returns:
+ True if credentials are available, False otherwise.
+ """
+ from sports_betting_edge.config.settings import settings
+
+ if not settings.ov_customer_id or not settings.ov_password:
+ logger.warning("OV_CUSTOMER_ID / OV_PASSWORD not set. Overtime collection will be skipped.")
+ return False
+ return True
+
+
+async def main_async() -> None:
+ """Run Overtime.ag snapshot collection."""
+ from sports_betting_edge.config.settings import settings
+ from sports_betting_edge.services.overtime_api_collection import (
+ _parse_target_sports,
+ collect_all_target_sports,
+ )
+
+ targets = _parse_target_sports(settings.overtime_target_sports)
+ target_labels = [f"{s}/{t}" for s, t in targets]
+ logger.info("Targets: %s", ", ".join(target_labels))
+
+ start = time.monotonic()
+ results = await collect_all_target_sports(targets=targets)
+ elapsed = time.monotonic() - start
+
+ # Log per-sport results
+ for key, count in results.items():
+ logger.info(" %s: %d lines", key, count)
+
+ total = sum(results.values())
+ logger.info(
+ "[OK] Overtime snapshot complete! %d total lines in %.1fs",
+ total,
+ elapsed,
+ )
+
+
+def main() -> None:
+ """Run Overtime snapshot collection (sync wrapper)."""
+ logger.info("Starting Overtime.ag snapshot collection...")
+
+ if not check_credentials():
+ logger.warning("[WARNING] Skipping Overtime collection - no credentials")
+ sys.exit(0)
+
+ try:
+ asyncio.run(main_async())
+ except Exception as e:
+ logger.error("Overtime collection failed: %s", e, exc_info=True)
+ logger.warning("[WARNING] Overtime collection failed but pipeline will continue")
+ sys.exit(0) # Don't fail the pipeline
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/collection/collect_today_odds.py b/scripts/collection/collect_today_odds.py
new file mode 100644
index 000000000..d944a3e40
--- /dev/null
+++ b/scripts/collection/collect_today_odds.py
@@ -0,0 +1,328 @@
+"""Collect today's NCAAB odds for spreads and totals.
+
+This script fetches current odds from The Odds API and normalizes them according
+to the betting data normalization standards:
+- One canonical spread per game (not separate favorite/underdog rows)
+- One total per game (not separate over/under rows)
+- Proper favorite/underdog decomposition
+
+Usage:
+ uv run python scripts/collect_today_odds.py
+ uv run python scripts/collect_today_odds.py --output data/odds_api/today_odds.parquet
+"""
+
+import asyncio
+import logging
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+
+from sports_betting_edge.adapters.odds_api import OddsAPIAdapter
+from sports_betting_edge.config.settings import settings
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def normalize_spread_data(
+ event: dict[str, Any], bookmaker: dict[str, Any], market: dict[str, Any]
+) -> dict[str, Any] | None:
+ """Normalize spread market to one canonical row per game.
+
+ Args:
+ event: Event data from API
+ bookmaker: Bookmaker data
+ market: Market data containing outcomes
+
+ Returns:
+ Normalized spread record with single magnitude value, or None if invalid
+ """
+ outcomes = market.get("outcomes", [])
+ if len(outcomes) != 2:
+ logger.warning(f"Expected 2 outcomes for spread, got {len(outcomes)}")
+ return None
+
+ # Find favorite (negative spread) and underdog (positive spread)
+ favorite = None
+ underdog = None
+
+ for outcome in outcomes:
+ point = outcome.get("point")
+ if point is None:
+ continue
+
+ if point < 0:
+ favorite = outcome
+ elif point > 0:
+ underdog = outcome
+ # Skip point == 0 (pick'em) for now - handle separately if needed
+
+ if not favorite or not underdog:
+ logger.warning(f"Could not identify favorite/underdog for event {event['id']}")
+ return None
+
+ # Verify magnitudes match (should be equal)
+ fav_mag = abs(favorite["point"])
+ dog_mag = abs(underdog["point"])
+ if fav_mag != dog_mag:
+ logger.warning(f"Spread magnitude mismatch: {fav_mag} vs {dog_mag} for event {event['id']}")
+
+ return {
+ "event_id": event["id"],
+ "sport_key": event["sport_key"],
+ "commence_time": event["commence_time"],
+ "home_team": event["home_team"],
+ "away_team": event["away_team"],
+ "bookmaker_key": bookmaker["key"],
+ "bookmaker_title": bookmaker["title"],
+ "market_key": "spreads",
+ "favorite_team": favorite["name"],
+ "underdog_team": underdog["name"],
+ "spread_magnitude": fav_mag,
+ "favorite_price": favorite.get("price"),
+ "underdog_price": underdog.get("price"),
+ "captured_at": datetime.now(UTC).isoformat(),
+ }
+
+
+def normalize_total_data(
+ event: dict[str, Any], bookmaker: dict[str, Any], market: dict[str, Any]
+) -> dict[str, Any] | None:
+ """Normalize total market to one canonical row per game.
+
+ Args:
+ event: Event data from API
+ bookmaker: Bookmaker data
+ market: Market data containing outcomes
+
+ Returns:
+ Normalized total record with single total value, or None if invalid
+ """
+ outcomes = market.get("outcomes", [])
+ if len(outcomes) != 2:
+ logger.warning(f"Expected 2 outcomes for total, got {len(outcomes)}")
+ return None
+
+ # Find over and under
+ over = None
+ under = None
+
+ for outcome in outcomes:
+ name = outcome.get("name", "").lower()
+ if name == "over":
+ over = outcome
+ elif name == "under":
+ under = outcome
+
+ if not over or not under:
+ logger.warning(f"Could not identify over/under for event {event['id']}")
+ return None
+
+ # Verify totals match (should be equal)
+ over_total = over.get("point")
+ under_total = under.get("point")
+
+ if over_total is None or under_total is None:
+ logger.warning(f"Missing total value for event {event['id']}")
+ return None
+
+ if over_total != under_total:
+ logger.warning(f"Total mismatch: {over_total} vs {under_total} for event {event['id']}")
+
+ return {
+ "event_id": event["id"],
+ "sport_key": event["sport_key"],
+ "commence_time": event["commence_time"],
+ "home_team": event["home_team"],
+ "away_team": event["away_team"],
+ "bookmaker_key": bookmaker["key"],
+ "bookmaker_title": bookmaker["title"],
+ "market_key": "totals",
+ "total": over_total,
+ "over_price": over.get("price"),
+ "under_price": under.get("price"),
+ "captured_at": datetime.now(UTC).isoformat(),
+ }
+
+
+async def collect_and_normalize_odds() -> tuple[pd.DataFrame, pd.DataFrame]:
+ """Collect and normalize today's NCAAB odds.
+
+ Returns:
+ Tuple of (spreads_df, totals_df)
+ """
+ logger.info("Initializing Odds API adapter...")
+ adapter = OddsAPIAdapter()
+
+ try:
+ logger.info("Fetching NCAAB odds for spreads and totals...")
+ odds_data = await adapter.get_ncaab_odds(markets="spreads,totals")
+
+ logger.info(f"Retrieved {len(odds_data)} events from Odds API")
+
+ # Process each event and bookmaker
+ spread_records = []
+ total_records = []
+
+ for event in odds_data:
+ bookmakers = event.get("bookmakers", [])
+
+ for bookmaker in bookmakers:
+ markets = bookmaker.get("markets", [])
+
+ for market in markets:
+ market_key = market.get("key")
+
+ if market_key == "spreads":
+ record = normalize_spread_data(event, bookmaker, market)
+ if record:
+ spread_records.append(record)
+
+ elif market_key == "totals":
+ record = normalize_total_data(event, bookmaker, market)
+ if record:
+ total_records.append(record)
+
+ spreads_df = pd.DataFrame(spread_records)
+ totals_df = pd.DataFrame(total_records)
+
+ logger.info(f"Normalized {len(spreads_df)} spread records")
+ logger.info(f"Normalized {len(totals_df)} total records")
+
+ # Log quota usage
+ quota_remaining = adapter.get_quota_remaining()
+ quota_used = adapter.get_quota_used()
+ if quota_remaining is not None:
+ logger.info(f"API quota remaining: {quota_remaining:,}")
+ if quota_used is not None:
+ logger.info(f"API quota used: {quota_used:,}")
+
+ return spreads_df, totals_df
+
+ finally:
+ await adapter.close()
+
+
+def validate_normalized_data(spreads_df: pd.DataFrame, totals_df: pd.DataFrame) -> None:
+ """Validate normalized betting data.
+
+ Args:
+ spreads_df: Normalized spreads DataFrame
+ totals_df: Normalized totals DataFrame
+ """
+ logger.info("Validating normalized data...")
+ errors = []
+
+ # Validate spreads
+ if not spreads_df.empty:
+ # Check for extreme spreads
+ extreme = spreads_df[spreads_df["spread_magnitude"] > 50]
+ if len(extreme) > 0:
+ errors.append(f"WARNING: {len(extreme)} extreme spreads (>50 points)")
+
+ # Check spread magnitude is always positive
+ negative = spreads_df[spreads_df["spread_magnitude"] < 0]
+ if len(negative) > 0:
+ errors.append(f"FATAL: {len(negative)} negative spread magnitudes")
+
+ # Check no duplicate spreads per game per bookmaker
+ dupes = spreads_df.groupby(["event_id", "bookmaker_key"]).size()
+ if (dupes > 1).any():
+ errors.append(f"FATAL: Duplicate spreads detected - {(dupes > 1).sum()} games affected")
+
+ # Validate totals
+ if not totals_df.empty:
+ # Check for extreme totals
+ extreme_low = totals_df[totals_df["total"] < 100]
+ extreme_high = totals_df[totals_df["total"] > 200]
+ if len(extreme_low) > 0:
+ errors.append(f"WARNING: {len(extreme_low)} unusually low totals (<100)")
+ if len(extreme_high) > 0:
+ errors.append(f"WARNING: {len(extreme_high)} unusually high totals (>200)")
+
+ # Check no duplicate totals per game per bookmaker
+ dupes = totals_df.groupby(["event_id", "bookmaker_key"]).size()
+ if (dupes > 1).any():
+ errors.append(f"FATAL: Duplicate totals detected - {(dupes > 1).sum()} games affected")
+
+ if errors:
+ for error in errors:
+ if error.startswith("FATAL"):
+ logger.error(error)
+ else:
+ logger.warning(error)
+ else:
+ logger.info("[OK] All validation checks passed")
+
+
+def main() -> None:
+ """Collect and save today's NCAAB odds."""
+ import sys
+
+ # Parse optional output path
+ output_dir = settings.daily_odds_dir
+ if len(sys.argv) > 2 and sys.argv[1] == "--output":
+ output_dir = Path(sys.argv[2]).parent
+
+ logger.info("Starting NCAAB odds collection for today...")
+
+ # Collect and normalize
+ spreads_df, totals_df = asyncio.run(collect_and_normalize_odds())
+
+ if spreads_df.empty and totals_df.empty:
+ logger.warning("No odds data collected - no games scheduled?")
+ return
+
+ # Validate
+ validate_normalized_data(spreads_df, totals_df)
+
+ # Save to parquet
+ output_dir.mkdir(parents=True, exist_ok=True)
+ today = datetime.now().strftime("%Y-%m-%d")
+
+ if not spreads_df.empty:
+ spreads_path = output_dir / f"{today}_spreads.parquet"
+ spreads_df.to_parquet(spreads_path, index=False)
+ logger.info(f"[OK] Saved {len(spreads_df)} spreads to {spreads_path}")
+
+ # Summary stats
+ unique_games = spreads_df["event_id"].nunique()
+ unique_bookmakers = spreads_df["bookmaker_key"].nunique()
+ logger.info(f" {unique_games} unique games, {unique_bookmakers} bookmakers")
+
+ if not totals_df.empty:
+ totals_path = output_dir / f"{today}_totals.parquet"
+ totals_df.to_parquet(totals_path, index=False)
+ logger.info(f"[OK] Saved {len(totals_df)} totals to {totals_path}")
+
+ # Summary stats
+ unique_games = totals_df["event_id"].nunique()
+ unique_bookmakers = totals_df["bookmaker_key"].nunique()
+ logger.info(f" {unique_games} unique games, {unique_bookmakers} bookmakers")
+
+ # Show sample data
+ if not spreads_df.empty:
+ logger.info("\nSample spread data:")
+ sample = spreads_df.head(3)[
+ [
+ "commence_time",
+ "favorite_team",
+ "underdog_team",
+ "spread_magnitude",
+ "bookmaker_title",
+ ]
+ ]
+ print(sample.to_string(index=False))
+
+ if not totals_df.empty:
+ logger.info("\nSample total data:")
+ sample = totals_df.head(3)[
+ ["commence_time", "home_team", "away_team", "total", "bookmaker_title"]
+ ]
+ print(sample.to_string(index=False))
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/pipeline/.gitkeep b/scripts/pipeline/.gitkeep
new file mode 100644
index 000000000..e69de29bb
diff --git a/scripts/pipeline/check_pending_games.py b/scripts/pipeline/check_pending_games.py
new file mode 100644
index 000000000..9f5afb14c
--- /dev/null
+++ b/scripts/pipeline/check_pending_games.py
@@ -0,0 +1,90 @@
+"""
+Quick check for pending games that need results entered.
+
+Shows which games still need scores and summary of tracked performance.
+"""
+
+import logging
+from datetime import datetime
+from pathlib import Path
+
+from betting_tracker import BettingTracker
+
+logger = logging.getLogger(__name__)
+
+
+def main() -> None:
+ """Check pending games and show progress."""
+ import argparse
+
+ parser = argparse.ArgumentParser(description="Check pending games and tracking progress")
+ parser.add_argument(
+ "--predictions",
+ type=Path,
+ help="Path to predictions CSV (default: today's file)",
+ )
+ parser.add_argument(
+ "--unit-size",
+ type=float,
+ default=100.0,
+ help="Dollar value of 1 unit",
+ )
+
+ args = parser.parse_args()
+
+ logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+
+ # Find predictions file
+ if args.predictions:
+ predictions_path = args.predictions
+ else:
+ # Look for today's predictions
+ today_str = datetime.now().strftime("%Y-%m-%d")
+ predictions_path = Path(f"data/analysis/combined_predictions_{today_str}.csv")
+
+ if not predictions_path.exists():
+ print(f"[ERROR] No predictions file found for today ({today_str})")
+ print(" Run: uv run python scripts/deploy_today_predictions.py")
+ return
+
+ # Initialize tracker
+ tracker = BettingTracker(predictions_path, unit_size=args.unit_size)
+
+ # Get pending games
+ pending = tracker.get_pending_games()
+ total_games = len(tracker.predictions_df)
+ graded_games = total_games - len(pending)
+
+ print("\n" + "=" * 70)
+ print("BETTING TRACKER STATUS")
+ print("=" * 70)
+ print(f"File: {predictions_path.name}")
+ print(f"Total Games: {total_games}")
+ print(f"Graded: {graded_games}")
+ print(f"Pending: {len(pending)}")
+ print()
+
+ if len(pending) > 0:
+ print("-" * 70)
+ print("PENDING GAMES")
+ print("-" * 70)
+ for _, game in pending.iterrows():
+ print(f" {game['Game_Time']}: {game['Away_Team']} @ {game['Home_Team']}")
+ print()
+ print("[INFO] To enter results:")
+ print(" uv run python scripts/enter_results.py --show-summary")
+ else:
+ print("[OK] All games have been graded!")
+
+ # Show summary if we have any results
+ if graded_games > 0:
+ print()
+ tracker.print_summary()
+ else:
+ print("\n[INFO] No games graded yet - enter results to see summary")
+
+ print("=" * 70)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/pipeline/check_pipeline_health.py b/scripts/pipeline/check_pipeline_health.py
new file mode 100644
index 000000000..9d55b7b61
--- /dev/null
+++ b/scripts/pipeline/check_pipeline_health.py
@@ -0,0 +1,432 @@
+"""Unified pipeline health check - monitors all data sources and training readiness.
+
+This script consolidates validation checks from Phases 1-5 into a single dashboard.
+Run daily to ensure data pipeline is functioning correctly.
+
+Usage:
+ python scripts/check_pipeline_health.py
+ python scripts/check_pipeline_health.py --output data/outputs/reports/health_check.txt
+ python scripts/check_pipeline_health.py --alert-on-critical
+"""
+
+from __future__ import annotations
+
+import argparse
+import sqlite3
+import sys
+from datetime import datetime, timedelta
+from pathlib import Path
+from typing import Any
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+
+
+class PipelineHealthChecker:
+ """Monitors data pipeline health across all sources."""
+
+ def __init__(self) -> None:
+ """Initialize health checker."""
+ self.issues: list[dict[str, Any]] = []
+ self.warnings: list[dict[str, Any]] = []
+ self.stats: dict[str, Any] = {}
+
+ def check_data_recency(self) -> None:
+ """Check if data sources are up-to-date."""
+ print("\n[1/6] Data Recency Check")
+ print("-" * 60)
+
+ # KenPom ratings
+ kenpom_file = Path("data/kenpom/ratings/season/ratings_2026.parquet")
+ if kenpom_file.exists():
+ age_hours = (
+ datetime.now() - datetime.fromtimestamp(kenpom_file.stat().st_mtime)
+ ).total_seconds() / 3600
+ status = "[OK]" if age_hours < 24 else "[WARNING]" if age_hours < 48 else "[ERROR]"
+ print(f" KenPom Ratings: {age_hours:.1f}h ago {status}")
+ if age_hours >= 24:
+ self._add_warning("KenPom ratings", f"{age_hours:.1f}h old (should update daily)")
+ self.stats["kenpom_age_hours"] = age_hours
+ else:
+ print(" KenPom Ratings: MISSING [ERROR]")
+ self._add_issue("KenPom ratings", "File missing")
+
+ # Odds API database
+ odds_db = Path("data/odds_api/odds_api.sqlite3")
+ if odds_db.exists():
+ age_hours = (
+ datetime.now() - datetime.fromtimestamp(odds_db.stat().st_mtime)
+ ).total_seconds() / 3600
+ status = "[OK]" if age_hours < 1 else "[WARNING]"
+ print(f" Odds API Database: {age_hours:.1f}h ago {status}")
+ if age_hours >= 1:
+ self._add_warning(
+ "Odds API", f"{age_hours:.1f}h old (real-time collection may be down)"
+ )
+ self.stats["odds_api_age_hours"] = age_hours
+ else:
+ print(" Odds API Database: MISSING [ERROR]")
+ self._add_issue("Odds API database", "File missing")
+
+ # ESPN schedule
+ espn_dir = Path("data/espn/schedule")
+ if espn_dir.exists():
+ files = sorted(
+ espn_dir.glob("*.parquet"), key=lambda x: x.stat().st_mtime, reverse=True
+ )
+ if files:
+ latest = files[0]
+ age_hours = (
+ datetime.now() - datetime.fromtimestamp(latest.stat().st_mtime)
+ ).total_seconds() / 3600
+ status = "[OK]" if age_hours < 24 else "[WARNING]" if age_hours < 48 else "[ERROR]"
+ print(f" ESPN Schedule: {age_hours:.1f}h ago (latest: {latest.stem}) {status}")
+ if age_hours >= 24:
+ self._add_warning(
+ "ESPN schedule", f"{age_hours:.1f}h old (should update daily)"
+ )
+ self.stats["espn_age_hours"] = age_hours
+ else:
+ print(" ESPN Schedule: NO FILES [ERROR]")
+ self._add_issue("ESPN schedule", "No schedule files found")
+ else:
+ print(" ESPN Schedule: MISSING [ERROR]")
+ self._add_issue("ESPN schedule", "Directory missing")
+
+ # ML datasets
+ ml_dir = Path("data/ml")
+ if ml_dir.exists():
+ spreads = ml_dir / "spreads_2025-12-01_2026-02-03.parquet"
+ if spreads.exists():
+ age_hours = (
+ datetime.now() - datetime.fromtimestamp(spreads.stat().st_mtime)
+ ).total_seconds() / 3600
+ status = "[OK]" if age_hours < 24 else "[WARNING]"
+ print(f" ML Datasets: {age_hours:.1f}h ago {status}")
+ self.stats["ml_age_hours"] = age_hours
+ else:
+ print(" ML Datasets: MISSING [WARNING]")
+ self._add_warning("ML datasets", "Latest spreads file missing")
+
+ def check_team_mappings(self) -> None:
+ """Check team mapping coverage."""
+ print("\n[2/6] Team Mapping Coverage")
+ print("-" * 60)
+
+ mapping_path = Path("data/staging/mappings/team_mapping.parquet")
+ if not mapping_path.exists():
+ print(" [ERROR] Team mapping file missing")
+ self._add_issue("Team mappings", "File missing")
+ return
+
+ try:
+ mapping_df = read_parquet_df(str(mapping_path))
+ total_mappings = len(mapping_df)
+ print(f" Total mappings: {total_mappings}")
+ self.stats["total_team_mappings"] = total_mappings
+
+ # Check for unmapped teams in database
+ odds_db = Path("data/odds_api/odds_api.sqlite3")
+ if odds_db.exists():
+ conn = sqlite3.connect(str(odds_db))
+ cursor = conn.cursor()
+ cursor.execute(
+ """SELECT DISTINCT home_team FROM events
+ UNION SELECT DISTINCT away_team FROM events"""
+ )
+ all_teams = {row[0] for row in cursor.fetchall()}
+ conn.close()
+
+ mapped_teams = set(mapping_df["odds_api_name"].dropna().unique())
+ unmapped = all_teams - mapped_teams
+ unmapped_count = len(unmapped)
+
+ if unmapped_count == 0:
+ print(" [OK] All teams mapped")
+ elif unmapped_count <= 5:
+ print(f" [WARNING] {unmapped_count} unmapped teams (likely non-D1)")
+ for team in list(unmapped)[:5]:
+ print(f" - {team}")
+ self._add_warning("Team mappings", f"{unmapped_count} unmapped teams")
+ else:
+ print(f" [ERROR] {unmapped_count} unmapped teams")
+ self._add_issue("Team mappings", f"{unmapped_count} teams unmapped")
+
+ self.stats["unmapped_teams"] = unmapped_count
+ except Exception as e:
+ print(f" [ERROR] Failed to check mappings: {e}")
+ self._add_issue("Team mappings", str(e))
+
+ def check_score_coverage(self) -> None:
+ """Check if scores are being collected for completed games."""
+ print("\n[3/6] Score Collection Coverage")
+ print("-" * 60)
+
+ odds_db = Path("data/odds_api/odds_api.sqlite3")
+ if not odds_db.exists():
+ print(" [ERROR] Odds API database missing")
+ return
+
+ try:
+ conn = sqlite3.connect(str(odds_db))
+ cursor = conn.cursor()
+
+ # Check last 3 days for missing scores
+ today = datetime.now().date()
+ for days_back in range(1, 4):
+ check_date = today - timedelta(days=days_back)
+ date_str = check_date.strftime("%Y-%m-%d")
+
+ cursor.execute(
+ """
+ SELECT COUNT(*) as total,
+ SUM(CASE WHEN s.event_id IS NOT NULL THEN 1 ELSE 0 END) as with_scores
+ FROM events e
+ LEFT JOIN scores s ON e.event_id = s.event_id
+ WHERE DATE(e.commence_time) = ?
+ """,
+ (date_str,),
+ )
+ row = cursor.fetchone()
+ total, with_scores = row if row else (0, 0)
+ with_scores = with_scores or 0
+
+ if total == 0:
+ continue
+
+ coverage_pct = (with_scores / total * 100) if total > 0 else 0
+ missing = total - with_scores
+
+ if coverage_pct >= 95:
+ status = "[OK]"
+ elif coverage_pct >= 80:
+ status = "[WARNING]"
+ else:
+ status = "[ERROR]"
+
+ print(f" {date_str}: {with_scores}/{total} games ({coverage_pct:.0f}%) {status}")
+
+ if missing > 0 and days_back <= 2 and coverage_pct < 95:
+ self._add_issue(
+ "Score collection", f"{missing} games on {date_str} missing scores"
+ )
+
+ conn.close()
+ except Exception as e:
+ print(f" [ERROR] Failed to check scores: {e}")
+ self._add_issue("Score collection", str(e))
+
+ def check_feature_engineering(self) -> None:
+ """Check if feature engineering is working."""
+ print("\n[4/6] Feature Engineering Health")
+ print("-" * 60)
+
+ # Check latest ML dataset
+ ml_file = Path("data/ml/spreads_2025-12-01_2026-02-03.parquet")
+ if not ml_file.exists():
+ print(" [ERROR] Latest spreads dataset missing")
+ self._add_issue("Feature engineering", "Latest dataset file missing")
+ return
+
+ try:
+ df = read_parquet_df(str(ml_file))
+ total_features = len(df.columns)
+ total_rows = len(df)
+ nan_pct = (df.isna().sum().sum() / (total_rows * total_features)) * 100
+
+ print(f" Total games: {total_rows}")
+ print(f" Total features: {total_features}")
+ print(f" NaN percentage: {nan_pct:.2f}%")
+
+ self.stats["ml_total_games"] = total_rows
+ self.stats["ml_total_features"] = total_features
+ self.stats["ml_nan_pct"] = nan_pct
+
+ # Check for sharp book data (lowvig)
+ if "sharp_closing_spread" in df.columns:
+ sharp_coverage = df["sharp_closing_spread"].notna().sum() / total_rows * 100
+ if sharp_coverage == 0:
+ print(" [ERROR] Sharp book data: 0% coverage (lowvig missing)")
+ self._add_issue("Sharp book data", "100% missing (lowvig unavailable)")
+ elif sharp_coverage < 50:
+ print(f" [WARNING] Sharp book data: {sharp_coverage:.0f}% coverage")
+ self._add_warning("Sharp book data", f"Only {sharp_coverage:.0f}% coverage")
+ else:
+ print(f" [OK] Sharp book data (lowvig): {sharp_coverage:.0f}% coverage")
+
+ # Check expected feature count
+ if total_features < 50:
+ print(f" [WARNING] Only {total_features} features (expected ~54)")
+ self._add_warning("Features", f"Only {total_features} features generated")
+ else:
+ print(f" [OK] Feature count: {total_features}")
+
+ except Exception as e:
+ print(f" [ERROR] Failed to check features: {e}")
+ self._add_issue("Feature engineering", str(e))
+
+ def check_model_freshness(self) -> None:
+ """Check if models have been retrained recently."""
+ print("\n[5/6] Model Artifact Freshness")
+ print("-" * 60)
+
+ models_dir = Path("models")
+ if not models_dir.exists():
+ print(" [ERROR] models/ directory missing")
+ self._add_issue("Models", "Directory missing")
+ return
+
+ model_files = ["spreads_model.json", "totals_model.json"]
+ for model_name in model_files:
+ model_path = models_dir / model_name
+ if model_path.exists():
+ age_days = (
+ datetime.now() - datetime.fromtimestamp(model_path.stat().st_mtime)
+ ).days
+ if age_days == 0:
+ age_str = "today"
+ status = "[OK]"
+ elif age_days < 7:
+ age_str = f"{age_days}d ago"
+ status = "[OK]"
+ elif age_days < 30:
+ age_str = f"{age_days}d ago"
+ status = "[WARNING]"
+ else:
+ age_str = f"{age_days}d ago"
+ status = "[ERROR]"
+
+ print(f" {model_name:30} {age_str:15} {status}")
+
+ if age_days >= 7:
+ self._add_warning("Model freshness", f"{model_name} is {age_days}d old")
+ else:
+ print(f" {model_name:30} MISSING [ERROR]")
+ self._add_issue("Models", f"{model_name} missing")
+
+ def check_database_health(self) -> None:
+ """Check Odds API database health."""
+ print("\n[6/6] Database Health")
+ print("-" * 60)
+
+ odds_db = Path("data/odds_api/odds_api.sqlite3")
+ if not odds_db.exists():
+ print(" [ERROR] Database file missing")
+ return
+
+ try:
+ size_mb = odds_db.stat().st_size / (1024 * 1024)
+ print(f" Database size: {size_mb:.1f} MB")
+ self.stats["db_size_mb"] = size_mb
+
+ conn = sqlite3.connect(str(odds_db))
+ cursor = conn.cursor()
+
+ # Check table counts
+ cursor.execute("SELECT COUNT(*) FROM events")
+ event_count = cursor.fetchone()[0]
+ print(f" Events: {event_count:,}")
+ self.stats["total_events"] = event_count
+
+ cursor.execute("SELECT COUNT(*) FROM observations")
+ obs_count = cursor.fetchone()[0]
+ print(f" Observations: {obs_count:,}")
+ self.stats["total_observations"] = obs_count
+
+ cursor.execute("SELECT COUNT(*) FROM scores WHERE completed = 1")
+ score_count = cursor.fetchone()[0]
+ print(f" Completed games: {score_count:,}")
+ self.stats["completed_games"] = score_count
+
+ # Check observation density
+ if event_count > 0:
+ obs_per_event = obs_count / event_count
+ if obs_per_event < 1000:
+ print(f" [WARNING] Only {obs_per_event:.0f} obs/event (low granularity)")
+ self._add_warning("Observations", f"Low density: {obs_per_event:.0f} obs/event")
+ else:
+ print(f" [OK] {obs_per_event:.0f} obs/event (good granularity)")
+
+ conn.close()
+
+ except Exception as e:
+ print(f" [ERROR] Database check failed: {e}")
+ self._add_issue("Database", str(e))
+
+ def _add_issue(self, category: str, message: str) -> None:
+ """Add a critical issue."""
+ self.issues.append({"category": category, "message": message})
+
+ def _add_warning(self, category: str, message: str) -> None:
+ """Add a warning."""
+ self.warnings.append({"category": category, "message": message})
+
+ def print_summary(self) -> None:
+ """Print health check summary."""
+ print("\n" + "=" * 60)
+ print("PIPELINE HEALTH SUMMARY")
+ print("=" * 60)
+
+ if len(self.issues) == 0 and len(self.warnings) == 0:
+ print("\n[OK] All checks passed - pipeline is healthy!")
+ else:
+ if self.issues:
+ print(f"\n[ERROR] {len(self.issues)} critical issue(s):")
+ for issue in self.issues:
+ print(f" - {issue['category']}: {issue['message']}")
+
+ if self.warnings:
+ print(f"\n[WARNING] {len(self.warnings)} warning(s):")
+ for warning in self.warnings:
+ print(f" - {warning['category']}: {warning['message']}")
+
+ print()
+
+ def get_exit_code(self) -> int:
+ """Get exit code based on health status."""
+ if self.issues:
+ return 1 # Critical issues
+ elif self.warnings:
+ return 2 # Warnings only
+ else:
+ return 0 # All good
+
+
+def main() -> None:
+ """Run pipeline health check."""
+ parser = argparse.ArgumentParser(description="Check data pipeline health")
+ parser.add_argument("--output", help="Save report to file")
+ parser.add_argument(
+ "--alert-on-critical", action="store_true", help="Exit with code 1 on critical issues"
+ )
+ args = parser.parse_args()
+
+ print("=" * 60)
+ print("SPORTS BETTING DATA PIPELINE HEALTH CHECK")
+ print("=" * 60)
+ print(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+
+ checker = PipelineHealthChecker()
+
+ # Run all checks
+ checker.check_data_recency()
+ checker.check_team_mappings()
+ checker.check_score_coverage()
+ checker.check_feature_engineering()
+ checker.check_model_freshness()
+ checker.check_database_health()
+
+ # Print summary
+ checker.print_summary()
+
+ # Save to file if requested
+ if args.output:
+ # TODO: Implement file output
+ print(f"Note: Report would be saved to {args.output}")
+
+ # Exit with appropriate code
+ if args.alert_on_critical:
+ sys.exit(checker.get_exit_code())
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/pipeline/convert_overtime_json_to_csv.py b/scripts/pipeline/convert_overtime_json_to_csv.py
new file mode 100644
index 000000000..682991436
--- /dev/null
+++ b/scripts/pipeline/convert_overtime_json_to_csv.py
@@ -0,0 +1,95 @@
+"""Convert Overtime JSON odds data to structured CSV format.
+
+Usage:
+ uv run python scripts/convert_overtime_json_to_csv.py
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import write_csv
+
+
+def convert_overtime_json_to_csv(
+ json_path: Path,
+ output_path: Path,
+) -> None:
+ """Convert Overtime JSON odds to CSV format.
+
+ Args:
+ json_path: Path to input JSON file
+ output_path: Path to output CSV file
+ """
+ # Read JSON file
+ with open(json_path) as f:
+ data = json.load(f)
+
+ # Extract games list
+ games = data.get("games", [])
+
+ if not games:
+ raise ValueError("No games found in JSON file")
+
+ # Convert to DataFrame
+ df = pd.DataFrame(games)
+
+ # Add metadata columns
+ df["captured_at"] = data.get("captured_at")
+ df["sport"] = data.get("sport")
+
+ # Reorder columns for better readability
+ column_order = [
+ "game_date_str",
+ "game_time_str",
+ "category",
+ "away_team",
+ "home_team",
+ "favorite_team",
+ "spread_magnitude",
+ "spread_favorite_price",
+ "spread_underdog_price",
+ "total_points",
+ "total_over_price",
+ "total_under_price",
+ "away_rotation",
+ "home_rotation",
+ "away_spread_raw",
+ "home_spread_raw",
+ "total_over_raw",
+ "total_under_raw",
+ "raw_matchup",
+ "captured_at",
+ "sport",
+ ]
+
+ # Use only columns that exist
+ df = df[[col for col in column_order if col in df.columns]]
+
+ # Write to CSV using filesystem adapter
+ write_csv(df, str(output_path), index=False)
+
+ print(f"[OK] Converted {len(df)} games to CSV")
+ print(f" Input: {json_path}")
+ print(f" Output: {output_path}")
+ print(f"\nColumns: {', '.join(df.columns)}")
+ print("Games by category:")
+ print(df["category"].value_counts().to_string())
+
+
+def main() -> None:
+ """Main entry point."""
+ json_path = Path("data/overtime/temp_odds.json")
+ output_path = Path("data/overtime/temp_odds.csv")
+
+ if not json_path.exists():
+ raise FileNotFoundError(f"JSON file not found: {json_path}")
+
+ convert_overtime_json_to_csv(json_path, output_path)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/pipeline/deploy_predictions.py b/scripts/pipeline/deploy_predictions.py
new file mode 100644
index 000000000..8ae1fe5af
--- /dev/null
+++ b/scripts/pipeline/deploy_predictions.py
@@ -0,0 +1,175 @@
+"""Deploy predictions: generate summary markdown and copy to latest.csv.
+
+Actions:
+- Read the predictions CSV (calibrated if available, else raw)
+- Generate predictions/{date}_summary.md with game count, top edges, model info
+- Copy final CSV to predictions/latest.csv for stable reference
+- Log deployment status
+
+Exit code 0 = success, 1 = failure.
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import shutil
+import sys
+from datetime import date
+from pathlib import Path
+
+import pandas as pd
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s [%(levelname)s] %(message)s",
+)
+logger = logging.getLogger(__name__)
+
+
+def _parse_prob(value: object) -> float:
+ """Convert probability value to float, handling '74%' style strings."""
+ s = str(value)
+ if s.endswith("%"):
+ return float(s.rstrip("%")) / 100.0
+ return float(s)
+
+
+def _normalize_prob_columns(df: pd.DataFrame) -> pd.DataFrame:
+ """Convert percentage-string probability columns to float [0, 1]."""
+ prob_cols = [
+ "favorite_cover_prob",
+ "underdog_cover_prob",
+ "over_prob",
+ "under_prob",
+ ]
+ df = df.copy()
+ for col in prob_cols:
+ if col in df.columns and not pd.api.types.is_numeric_dtype(df[col]):
+ df[col] = df[col].apply(_parse_prob)
+ return df
+
+
+def generate_summary(df: pd.DataFrame, target_date: str) -> str:
+ """Generate a markdown summary of the day's predictions."""
+ df = _normalize_prob_columns(df)
+ lines: list[str] = []
+ lines.append(f"# Predictions Summary - {target_date}")
+ lines.append("")
+ lines.append(f"**Games**: {len(df)}")
+ lines.append(f"**Generated**: {target_date}")
+ lines.append("")
+
+ # Spread edges - biggest underdog cover probabilities
+ if "underdog_cover_prob" in df.columns:
+ lines.append("## Top Spread Edges (Underdog Cover)")
+ top_spreads = df.nlargest(5, "underdog_cover_prob")
+ for _, row in top_spreads.iterrows():
+ prob = row["underdog_cover_prob"]
+ home = row.get("home_team", "?")
+ away = row.get("away_team", "?")
+ fav = row.get("favorite_team", "?")
+ mag = row.get("spread_magnitude", 0)
+ lines.append(f"- {away} @ {home} | Fav: {fav} -{mag} | Underdog cover: {prob:.1%}")
+ lines.append("")
+
+ # Totals edges - biggest over/under probabilities
+ if "over_prob" in df.columns:
+ lines.append("## Top Over Edges")
+ top_over = df.nlargest(3, "over_prob")
+ for _, row in top_over.iterrows():
+ prob = row["over_prob"]
+ home = row.get("home_team", "?")
+ away = row.get("away_team", "?")
+ total = row.get("predicted_total", row.get("total_points", 0))
+ lines.append(
+ f"- {away} @ {home} | Predicted total: {total:.1f} | Over prob: {prob:.1%}"
+ )
+ lines.append("")
+
+ lines.append("## Top Under Edges")
+ top_under = df.nlargest(3, "under_prob")
+ for _, row in top_under.iterrows():
+ prob = row["under_prob"]
+ home = row.get("home_team", "?")
+ away = row.get("away_team", "?")
+ total = row.get("predicted_total", row.get("total_points", 0))
+ lines.append(
+ f"- {away} @ {home} | Predicted total: {total:.1f} | Under prob: {prob:.1%}"
+ )
+ lines.append("")
+
+ # Model info
+ lines.append("## Model Info")
+ if "totals_method" in df.columns:
+ methods = df["totals_method"].value_counts()
+ for method, count in methods.items():
+ lines.append(f"- Totals method `{method}`: {count} games")
+ if "total_disconnect" in df.columns:
+ high_disconnect = df[df["total_disconnect"].abs() > 10]
+ if not high_disconnect.empty:
+ lines.append(f"- **{len(high_disconnect)} game(s) with total disconnect > 10 pts**")
+ lines.append("")
+
+ return "\n".join(lines)
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(description="Deploy predictions: summary and latest copy")
+ parser.add_argument(
+ "--date",
+ default=date.today().isoformat(),
+ help="Date in YYYY-MM-DD format (default: today)",
+ )
+ parser.add_argument(
+ "--predictions-dir",
+ default="predictions",
+ help="Predictions directory (default: predictions/)",
+ )
+ args = parser.parse_args()
+
+ predictions_dir = Path(args.predictions_dir)
+
+ # Find the best available CSV
+ calibrated = predictions_dir / f"{args.date}_calibrated.csv"
+ raw = predictions_dir / f"{args.date}.csv"
+
+ if calibrated.exists():
+ csv_path = calibrated
+ logger.info("Using calibrated predictions: %s", csv_path)
+ elif raw.exists():
+ csv_path = raw
+ logger.info("Using raw predictions: %s", csv_path)
+ else:
+ logger.error(
+ "No predictions found for %s in %s",
+ args.date,
+ predictions_dir,
+ )
+ sys.exit(1)
+
+ df = pd.read_csv(csv_path)
+ if df.empty:
+ logger.error("Predictions file is empty")
+ sys.exit(1)
+
+ # Generate summary markdown
+ summary = generate_summary(df, args.date)
+ summary_path = predictions_dir / f"{args.date}_summary.md"
+ summary_path.write_text(summary, encoding="utf-8")
+ logger.info("Summary written to %s", summary_path)
+
+ # Copy to latest.csv
+ latest_path = predictions_dir / "latest.csv"
+ shutil.copy2(csv_path, latest_path)
+ logger.info("Copied %s -> %s", csv_path.name, latest_path)
+
+ logger.info(
+ "Deployment complete: %d games, source=%s",
+ len(df),
+ csv_path.name,
+ )
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/pipeline/grade_predictions.py b/scripts/pipeline/grade_predictions.py
new file mode 100644
index 000000000..6e30ccdd8
--- /dev/null
+++ b/scripts/pipeline/grade_predictions.py
@@ -0,0 +1,277 @@
+"""Grade predictions and generate daily performance report.
+
+Runs daily after score collection completes. Grades all pending
+predictions, generates performance metrics, and creates GitHub
+issues if quality thresholds are breached.
+
+Usage:
+ uv run python scripts/pipeline/grade_predictions.py
+ uv run python scripts/pipeline/grade_predictions.py --date 2026-02-06
+ uv run python scripts/pipeline/grade_predictions.py --report-only
+ uv run python scripts/pipeline/grade_predictions.py --create-issue
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import subprocess
+from datetime import date, timedelta
+from pathlib import Path
+from typing import Any
+
+from sports_betting_edge.adapters.odds_api_db import (
+ OddsAPIDatabase,
+)
+from sports_betting_edge.services.prediction_grading import (
+ PredictionGrader,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def create_github_issue(
+ title: str,
+ body: str,
+ labels: list[str] | None = None,
+) -> bool:
+ """Create a GitHub issue via gh CLI.
+
+ Args:
+ title: Issue title.
+ body: Issue body (markdown).
+ labels: Optional list of labels.
+
+ Returns:
+ True if issue created successfully.
+ """
+ cmd = ["gh", "issue", "create", "--title", title, "--body", body]
+ if labels:
+ cmd.extend(["--label", ",".join(labels)])
+
+ try:
+ result = subprocess.run(
+ cmd,
+ capture_output=True,
+ text=True,
+ timeout=30,
+ )
+ if result.returncode == 0:
+ logger.info(
+ "[OK] Created GitHub issue: %s",
+ result.stdout.strip(),
+ )
+ return True
+ logger.warning(
+ "[WARNING] gh issue create failed: %s",
+ result.stderr.strip(),
+ )
+ return False
+ except FileNotFoundError:
+ logger.warning("[WARNING] gh CLI not found - skipping issue")
+ return False
+ except subprocess.TimeoutExpired:
+ logger.warning("[WARNING] gh issue create timed out")
+ return False
+
+
+def main() -> None:
+ """Main entry point for daily grading pipeline."""
+ parser = argparse.ArgumentParser(description="Grade predictions against actual results")
+ parser.add_argument(
+ "--date",
+ type=str,
+ default=None,
+ help="Grade specific date (default: yesterday)",
+ )
+ parser.add_argument(
+ "--report-only",
+ action="store_true",
+ help="Skip grading, just generate report",
+ )
+ parser.add_argument(
+ "--create-issue",
+ action="store_true",
+ help="Create GitHub issue on threshold breach",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Database path",
+ )
+ parser.add_argument(
+ "--predictions-dir",
+ type=Path,
+ default=Path("predictions"),
+ help="Predictions directory",
+ )
+ parser.add_argument(
+ "--output-dir",
+ type=Path,
+ default=Path("data/reports"),
+ help="Report output directory",
+ )
+ args = parser.parse_args()
+
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s %(levelname)s %(message)s",
+ )
+
+ # Determine target date
+ if args.date:
+ target_date = args.date
+ else:
+ yesterday = date.today() - timedelta(days=1)
+ target_date = yesterday.isoformat()
+
+ logger.info("=" * 60)
+ logger.info("Prediction Grading Pipeline")
+ logger.info("=" * 60)
+ logger.info("Target date: %s", target_date)
+
+ # Initialize
+ db = OddsAPIDatabase(str(args.db_path))
+ grader = PredictionGrader(db, args.predictions_dir)
+
+ if not args.report_only:
+ # Step 1: Store predictions if CSV exists
+ logger.info("\n--- Step 1: Store Predictions ---")
+ count = grader.store_predictions_from_csv(target_date)
+ if count > 0:
+ logger.info(
+ "[OK] Stored %d predictions for %s",
+ count,
+ target_date,
+ )
+ else:
+ logger.info(
+ "No new predictions to store for %s",
+ target_date,
+ )
+
+ # Step 2: Grade all pending predictions
+ logger.info("\n--- Step 2: Grade Pending ---")
+ result = grader.grade_pending()
+ logger.info(
+ "[OK] Graded %d/%d predictions (%d still pending)",
+ result["graded"],
+ result["total_pending"],
+ result["still_pending"],
+ )
+
+ if result["graded"] > 0:
+ spread_acc = (
+ result["spread_correct"] / result["spread_total"]
+ if result["spread_total"] > 0
+ else 0
+ )
+ total_acc = (
+ result["total_correct"] / result["total_total"] if result["total_total"] > 0 else 0
+ )
+ logger.info(
+ " Spreads: %d/%d (%.1f%%)",
+ result["spread_correct"],
+ result["spread_total"],
+ spread_acc * 100,
+ )
+ logger.info(
+ " Totals: %d/%d (%.1f%%)",
+ result["total_correct"],
+ result["total_total"],
+ total_acc * 100,
+ )
+
+ # Step 3: Generate performance reports
+ logger.info("\n--- Step 3: Performance Reports ---")
+
+ # 7-day report
+ end = date.fromisoformat(target_date)
+ start_7d = (end - timedelta(days=7)).isoformat()
+ start_30d = (end - timedelta(days=30)).isoformat()
+ start_season = "2025-11-01"
+
+ reports: dict[str, dict[str, Any]] = {}
+ for label, start in [
+ ("7d", start_7d),
+ ("30d", start_30d),
+ ("season", start_season),
+ ]:
+ report = grader.get_performance_report(start, target_date)
+ reports[label] = report
+ graded = report.get("graded", 0)
+ if graded > 0:
+ logger.info(
+ " %s: %d graded, spread %.1f%%, total %.1f%%",
+ label,
+ graded,
+ report.get("spread_accuracy", 0) * 100,
+ report.get("total_accuracy", 0) * 100,
+ )
+
+ # Step 4: Check quality thresholds
+ logger.info("\n--- Step 4: Quality Thresholds ---")
+ latest_report = reports.get("7d", {})
+ alerts = grader.check_quality_thresholds(latest_report)
+
+ if alerts:
+ logger.warning(
+ "[WARNING] %d quality threshold(s) breached:",
+ len(alerts),
+ )
+ for alert in alerts:
+ logger.warning(" -> %s", alert)
+
+ if args.create_issue:
+ body_lines = [
+ "## Quality Threshold Breach",
+ f"\n**Date:** {target_date}",
+ f"**7-day graded:** {latest_report.get('graded', 0)}",
+ "\n### Alerts",
+ ]
+ for alert in alerts:
+ body_lines.append(f"- {alert}")
+
+ body_lines.append("\n### 7-Day Metrics")
+ for key in [
+ "spread_accuracy",
+ "total_accuracy",
+ "spread_roi",
+ "total_roi",
+ ]:
+ if key in latest_report:
+ val = latest_report[key]
+ if "accuracy" in key:
+ body_lines.append(f"- {key}: {val:.1%}")
+ else:
+ body_lines.append(f"- {key}: {val:.1f}%")
+
+ create_github_issue(
+ title=(f"Model Quality Alert - {target_date}"),
+ body="\n".join(body_lines),
+ labels=["automation", "alert"],
+ )
+ else:
+ logger.info("[OK] All quality thresholds within bounds")
+
+ # Step 5: Save report
+ args.output_dir.mkdir(parents=True, exist_ok=True)
+ report_path = args.output_dir / f"grades_{target_date}.json"
+ output = {
+ "date": target_date,
+ "windows": reports,
+ "alerts": alerts,
+ }
+ with open(report_path, "w") as f:
+ json.dump(output, f, indent=2, default=str)
+ logger.info("\n[OK] Report saved to %s", report_path)
+
+ logger.info("\n" + "=" * 60)
+ logger.info("Grading pipeline complete")
+ logger.info("=" * 60)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/pipeline/model_performance_report.py b/scripts/pipeline/model_performance_report.py
new file mode 100644
index 000000000..750c87057
--- /dev/null
+++ b/scripts/pipeline/model_performance_report.py
@@ -0,0 +1,355 @@
+"""Generate model performance report with drift detection.
+
+Aggregates graded predictions over rolling windows and checks
+for model drift that would trigger retraining.
+
+Usage:
+ uv run python scripts/pipeline/model_performance_report.py
+ uv run python scripts/pipeline/model_performance_report.py --window 30
+ uv run python scripts/pipeline/model_performance_report.py --alert
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import subprocess
+import sys
+from datetime import date, timedelta
+from pathlib import Path
+from typing import Any
+
+from sports_betting_edge.adapters.odds_api_db import (
+ OddsAPIDatabase,
+)
+from sports_betting_edge.core.prediction_metrics import (
+ detect_model_drift,
+)
+from sports_betting_edge.services.prediction_grading import (
+ PredictionGrader,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def get_model_age_days(models_dir: Path) -> float:
+ """Get age of newest model file in days.
+
+ Args:
+ models_dir: Directory containing model .pkl files.
+
+ Returns:
+ Age in days of the most recently modified model.
+ """
+ model_files = list(models_dir.glob("*_score_2026.pkl"))
+ if not model_files:
+ return 999.0 # No models found
+
+ newest = max(f.stat().st_mtime for f in model_files)
+ import time
+
+ age_seconds = time.time() - newest
+ return age_seconds / 86400.0
+
+
+def load_retraining_triggers(
+ config_path: Path,
+) -> dict[str, float]:
+ """Load retraining triggers from team_config.json.
+
+ Args:
+ config_path: Path to team config file.
+
+ Returns:
+ Dict of threshold names to values.
+ """
+ if not config_path.exists():
+ return {
+ "model_age_days": 7,
+ "auc_drop_threshold": 0.05,
+ "ece_drift_threshold": 0.01,
+ "consecutive_losses_threshold": 7,
+ "win_rate_threshold": 0.52,
+ }
+
+ with open(config_path) as f:
+ config = json.load(f)
+
+ triggers: dict[str, float] = config.get("performance_monitoring", {}).get(
+ "retraining_triggers", {}
+ )
+ triggers["win_rate_threshold"] = config.get("quality_standards", {}).get(
+ "win_rate_threshold", 0.52
+ )
+ return triggers
+
+
+def compute_trend(
+ recent: dict[str, Any],
+ baseline: dict[str, Any],
+) -> dict[str, str]:
+ """Compare recent vs baseline metrics to detect trends.
+
+ Args:
+ recent: 7-day metrics.
+ baseline: 30-day metrics.
+
+ Returns:
+ Dict mapping metric name to trend direction.
+ """
+ trends: dict[str, str] = {}
+
+ for key in [
+ "spread_accuracy",
+ "total_accuracy",
+ ]:
+ r_val = recent.get(key, 0)
+ b_val = baseline.get(key, 0)
+ if b_val == 0:
+ trends[key] = "insufficient_data"
+ continue
+ diff = r_val - b_val
+ if diff > 0.03:
+ trends[key] = "improving"
+ elif diff < -0.03:
+ trends[key] = "degrading"
+ else:
+ trends[key] = "stable"
+
+ for key in ["brier_spread", "brier_total"]:
+ r_val = recent.get(key, 0)
+ b_val = baseline.get(key, 0)
+ if b_val == 0:
+ trends[key] = "insufficient_data"
+ continue
+ # Lower Brier is better
+ diff = r_val - b_val
+ if diff < -0.02:
+ trends[key] = "improving"
+ elif diff > 0.02:
+ trends[key] = "degrading"
+ else:
+ trends[key] = "stable"
+
+ return trends
+
+
+def main() -> None:
+ """Main entry point for performance report."""
+ parser = argparse.ArgumentParser(description="Generate model performance report")
+ parser.add_argument(
+ "--as-of",
+ type=str,
+ default=None,
+ help="Report as-of date (default: today)",
+ )
+ parser.add_argument(
+ "--alert",
+ action="store_true",
+ help="Create GitHub issue on drift detection",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Database path",
+ )
+ parser.add_argument(
+ "--models-dir",
+ type=Path,
+ default=Path("models"),
+ help="Models directory",
+ )
+ parser.add_argument(
+ "--config-path",
+ type=Path,
+ default=Path(".claude/team_config.json"),
+ help="Team config path",
+ )
+ parser.add_argument(
+ "--output-dir",
+ type=Path,
+ default=Path("data/reports"),
+ help="Report output directory",
+ )
+ args = parser.parse_args()
+
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s %(levelname)s %(message)s",
+ )
+
+ as_of = date.fromisoformat(args.as_of) if args.as_of else date.today()
+
+ logger.info("=" * 60)
+ logger.info("Model Performance Report")
+ logger.info("=" * 60)
+ logger.info("As of: %s", as_of)
+
+ # Initialize
+ db = OddsAPIDatabase(str(args.db_path))
+ grader = PredictionGrader(db)
+ triggers = load_retraining_triggers(args.config_path)
+
+ # Generate rolling window reports
+ windows: dict[str, dict[str, Any]] = {}
+ for label, days in [
+ ("last_7d", 7),
+ ("last_30d", 30),
+ ("season", 120),
+ ]:
+ start = (as_of - timedelta(days=days)).isoformat()
+ end = as_of.isoformat()
+ report = grader.get_performance_report(start, end)
+ windows[label] = report
+
+ graded = report.get("graded", 0)
+ if graded > 0:
+ logger.info("\n%s (%d games graded):", label, graded)
+ logger.info(
+ " Spread: %.1f%% accuracy, %.1f%% ROI",
+ report.get("spread_accuracy", 0) * 100,
+ report.get("spread_roi", 0),
+ )
+ logger.info(
+ " Total: %.1f%% accuracy, %.1f%% ROI",
+ report.get("total_accuracy", 0) * 100,
+ report.get("total_roi", 0),
+ )
+ if "brier_spread" in report:
+ logger.info(
+ " Brier: spread=%.4f, total=%.4f",
+ report.get("brier_spread", 0),
+ report.get("brier_total", 0),
+ )
+ if "ece_spread" in report:
+ logger.info(
+ " ECE: spread=%.4f, total=%.4f",
+ report.get("ece_spread", 0),
+ report.get("ece_total", 0),
+ )
+ else:
+ logger.info("\n%s: No graded predictions", label)
+
+ # Trend detection
+ recent = windows.get("last_7d", {})
+ baseline = windows.get("last_30d", {})
+ trends = compute_trend(recent, baseline)
+
+ logger.info("\n--- Trends (7d vs 30d) ---")
+ for metric, direction in trends.items():
+ indicator = {
+ "improving": "[OK]",
+ "stable": "[OK]",
+ "degrading": "[WARNING]",
+ "insufficient_data": "[-]",
+ }.get(direction, "[-]")
+ logger.info(" %s %s: %s", indicator, metric, direction)
+
+ # Model drift detection
+ model_age = get_model_age_days(args.models_dir)
+ recent_with_age = {**recent, "model_age_days": model_age}
+
+ drift_alerts = detect_model_drift(
+ recent_metrics=recent_with_age,
+ baseline_metrics=baseline,
+ thresholds=triggers,
+ )
+
+ retraining_recommended = len(drift_alerts) > 0
+
+ logger.info("\n--- Drift Detection ---")
+ if drift_alerts:
+ logger.warning("[WARNING] %d drift alert(s):", len(drift_alerts))
+ for alert in drift_alerts:
+ logger.warning(" -> %s", alert)
+ logger.warning("[WARNING] Retraining recommended!")
+ else:
+ logger.info("[OK] No model drift detected")
+
+ logger.info(" Model age: %.1f days", model_age)
+
+ # Accuracy by confidence
+ if "accuracy_by_confidence" in recent:
+ logger.info("\n--- Accuracy by Confidence ---")
+ for bucket, stats in recent["accuracy_by_confidence"].items():
+ if stats["count"] > 0:
+ logger.info(
+ " %s: %d games, %.1f%% accuracy",
+ bucket,
+ stats["count"],
+ stats["accuracy"] * 100,
+ )
+
+ # Save report
+ args.output_dir.mkdir(parents=True, exist_ok=True)
+ report_path = args.output_dir / f"performance_{as_of.isoformat()}.json"
+
+ output = {
+ "report_date": as_of.isoformat(),
+ "model_version": "score_v1",
+ "model_age_days": round(model_age, 1),
+ "windows": windows,
+ "trends": trends,
+ "drift_alerts": drift_alerts,
+ "retraining_recommended": retraining_recommended,
+ }
+
+ with open(report_path, "w") as f:
+ json.dump(output, f, indent=2, default=str)
+ logger.info("\n[OK] Report saved to %s", report_path)
+
+ # Create GitHub issue if drift detected
+ if drift_alerts and args.alert:
+ body_lines = [
+ "## Model Drift Detected",
+ f"\n**Date:** {as_of.isoformat()}",
+ f"**Model age:** {model_age:.1f} days",
+ "\n### Drift Alerts",
+ ]
+ for alert in drift_alerts:
+ body_lines.append(f"- {alert}")
+
+ body_lines.append("\n### Trends (7d vs 30d)")
+ for metric, direction in trends.items():
+ body_lines.append(f"- {metric}: {direction}")
+
+ body_lines.append("\n### Recommendation")
+ body_lines.append("Retrain models using:")
+ body_lines.append("```bash")
+ body_lines.append("uv run python scripts/training/train_score_models.py")
+ body_lines.append("```")
+
+ try:
+ subprocess.run(
+ [
+ "gh",
+ "issue",
+ "create",
+ "--title",
+ f"Model Drift Alert - {as_of.isoformat()}",
+ "--body",
+ "\n".join(body_lines),
+ "--label",
+ "automation,alert",
+ ],
+ capture_output=True,
+ text=True,
+ timeout=30,
+ )
+ logger.info("[OK] Created GitHub drift alert")
+ except (FileNotFoundError, subprocess.TimeoutExpired):
+ logger.warning("[WARNING] Could not create GitHub issue")
+
+ logger.info("\n" + "=" * 60)
+ logger.info("Performance report complete")
+ logger.info("=" * 60)
+
+ # Exit with non-zero if retraining recommended
+ if retraining_recommended:
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/pipeline/schedule_overtime_collection.py b/scripts/pipeline/schedule_overtime_collection.py
new file mode 100644
index 000000000..412104117
--- /dev/null
+++ b/scripts/pipeline/schedule_overtime_collection.py
@@ -0,0 +1,325 @@
+"""Scheduled Overtime.ag odds collection with timezone-aware game time checks.
+
+Collects odds at 10-minute intervals starting at 4 AM PST, continuing until
+each game starts. Handles timezone conversions between:
+- Overtime.ag: EST (UTC-5)
+- ESPN: PST (UTC-8)
+- System: UTC
+
+Usage:
+ # Run once (check if it's time to collect)
+ uv run python scripts/schedule_overtime_collection.py --once
+
+ # Run continuously (daemon mode)
+ uv run python scripts/schedule_overtime_collection.py --daemon
+
+ # Dry run (show what would happen)
+ uv run python scripts/schedule_overtime_collection.py --dry-run
+"""
+
+import argparse
+import json
+import logging
+import subprocess
+import time
+from datetime import UTC, datetime, timedelta
+from datetime import time as dt_time
+from pathlib import Path
+from zoneinfo import ZoneInfo
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+# Timezone definitions
+PST = ZoneInfo("America/Los_Angeles")
+EST = ZoneInfo("America/New_York")
+UTC = UTC
+
+# Schedule configuration
+COLLECTION_START_TIME = dt_time(4, 0) # 4:00 AM PST
+COLLECTION_INTERVAL_MINUTES = 10
+STOP_BEFORE_GAME_MINUTES = 5 # Stop collecting 5 min before tipoff
+
+
+def get_current_time_pst() -> datetime:
+ """Get current time in PST."""
+ return datetime.now(PST)
+
+
+def parse_overtime_game_time(game_time_str: str, game_date_str: str) -> datetime | None:
+ """Parse Overtime.ag game time strings to datetime in PST.
+
+ Overtime.ag shows times like "7:00 PM" (EST) with dates like "Sat Feb 1".
+
+ Args:
+ game_time_str: Time string (e.g., "7:00 PM")
+ game_date_str: Date string (e.g., "Sat Feb 1")
+
+ Returns:
+ Game datetime in PST, or None if parsing fails
+ """
+ try:
+ # Parse the date and time in EST
+ current_year = datetime.now().year
+ date_with_year = f"{game_date_str} {current_year}"
+
+ # Parse datetime string
+ datetime_str = f"{date_with_year} {game_time_str}"
+
+ # Try common formats
+ for fmt in [
+ "%a %b %d %Y %I:%M %p", # "Sat Feb 1 2026 7:00 PM"
+ "%a %b %d %Y %I %p", # "Sat Feb 1 2026 7 PM"
+ ]:
+ try:
+ # Parse as EST, then convert to PST
+ dt_est = datetime.strptime(datetime_str, fmt).replace(tzinfo=EST)
+ dt_pst = dt_est.astimezone(PST)
+ return dt_pst
+ except ValueError:
+ continue
+
+ logger.warning(f"Could not parse game time: {game_time_str} {game_date_str}")
+ return None
+
+ except Exception as e:
+ logger.warning(f"Error parsing game time: {e}")
+ return None
+
+
+def should_collect_now() -> bool:
+ """Check if current time is within collection window (4 AM - 11:59 PM PST)."""
+ now = get_current_time_pst()
+ current_time = now.time()
+
+ # Collect between 4 AM and midnight
+ return not current_time < COLLECTION_START_TIME
+
+
+def get_upcoming_games(overtime_data_path: Path) -> list[dict]:
+ """Load most recent Overtime data and filter for upcoming games.
+
+ Args:
+ overtime_data_path: Path to most recent Overtime parquet file
+
+ Returns:
+ List of games that haven't started yet
+ """
+ try:
+ df = read_parquet_df(str(overtime_data_path))
+ now_pst = get_current_time_pst()
+
+ upcoming = []
+ for _, row in df.iterrows():
+ game_time = parse_overtime_game_time(
+ row.get("game_time_str", ""), row.get("game_date_str", "")
+ )
+
+ if game_time and game_time > now_pst:
+ # Game hasn't started yet
+ time_until_game = (game_time - now_pst).total_seconds() / 60
+ upcoming.append(
+ {
+ "home_team": row["home_team"],
+ "away_team": row["away_team"],
+ "game_time_pst": game_time,
+ "minutes_until_game": time_until_game,
+ }
+ )
+
+ return upcoming
+
+ except Exception as e:
+ logger.error(f"Error loading upcoming games: {e}")
+ return []
+
+
+def run_overtime_collection() -> bool:
+ """Run Puppeteer script to collect Overtime.ag odds.
+
+ Returns:
+ True if collection succeeded, False otherwise
+ """
+ try:
+ # Generate output filename with timestamp
+ now = get_current_time_pst()
+ date_str = now.strftime("%Y-%m-%d")
+ time_str = now.strftime("%H%M")
+
+ output_dir = Path("data/overtime/snapshots")
+ output_dir.mkdir(parents=True, exist_ok=True)
+ output_file = output_dir / f"{date_str}_{time_str}.json"
+
+ # Run Puppeteer scraper
+ puppeteer_script = Path("puppeteer/capture_overtime_college_basketball_odds.js")
+ cmd = ["node", str(puppeteer_script), "--output", str(output_file)]
+
+ logger.info(f"Collecting Overtime.ag odds at {now.strftime('%I:%M %p PST')}...")
+ result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
+
+ if result.returncode != 0:
+ logger.error(f"Collection failed: {result.stderr}")
+ return False
+
+ # Check if any games were collected
+ with open(output_file) as f:
+ data = json.load(f)
+ game_count = data.get("game_count", 0)
+
+ if game_count == 0:
+ logger.warning("No games found (may be too early or all games finished)")
+ output_file.unlink() # Remove empty file
+ return False
+
+ logger.info(f"[OK] Collected {game_count} games -> {output_file.name}")
+ return True
+
+ except subprocess.TimeoutExpired:
+ logger.error("Collection timed out after 60 seconds")
+ return False
+ except Exception as e:
+ logger.error(f"Collection error: {e}")
+ return False
+
+
+def wait_until_next_interval(interval_minutes: int = COLLECTION_INTERVAL_MINUTES) -> None:
+ """Wait until the next collection interval.
+
+ Args:
+ interval_minutes: Minutes between collections
+ """
+ now = get_current_time_pst()
+
+ # Calculate next collection time (round up to next interval)
+ minutes_since_hour = now.minute
+ minutes_until_next = interval_minutes - (minutes_since_hour % interval_minutes)
+
+ if minutes_until_next == 0:
+ minutes_until_next = interval_minutes
+
+ next_collection = now + timedelta(minutes=minutes_until_next)
+ next_collection = next_collection.replace(second=0, microsecond=0)
+
+ wait_seconds = (next_collection - now).total_seconds()
+
+ logger.info(
+ f"Next collection at {next_collection.strftime('%I:%M %p PST')} "
+ f"({int(wait_seconds / 60)} minutes)"
+ )
+
+ time.sleep(wait_seconds)
+
+
+def run_once(dry_run: bool = False) -> None:
+ """Run single collection check.
+
+ Args:
+ dry_run: If True, only show what would happen without collecting
+ """
+ if not should_collect_now():
+ now = get_current_time_pst()
+ logger.info(
+ f"Outside collection window (current: {now.strftime('%I:%M %p PST')}, "
+ f"starts: {COLLECTION_START_TIME.strftime('%I:%M %p')})"
+ )
+ return
+
+ if dry_run:
+ now = get_current_time_pst()
+ logger.info(f"[DRY RUN] Would collect at {now.strftime('%I:%M %p PST')}")
+
+ # Check for upcoming games
+ overtime_dir = Path("data/overtime")
+ parquet_files = list(overtime_dir.glob("*.parquet"))
+ if parquet_files:
+ latest = max(parquet_files, key=lambda p: p.stat().st_mtime)
+ upcoming = get_upcoming_games(latest)
+ logger.info(f"Found {len(upcoming)} upcoming games")
+ for game in upcoming[:5]:
+ logger.info(
+ f" {game['away_team']} @ {game['home_team']} "
+ f"(in {int(game['minutes_until_game'])} min)"
+ )
+ return
+
+ # Run actual collection
+ run_overtime_collection()
+
+
+def run_daemon() -> None:
+ """Run continuous collection daemon.
+
+ Collects every 10 minutes from 4 AM PST until midnight.
+ Stops collecting for games that have already started.
+ """
+ logger.info("Starting Overtime.ag collection daemon")
+ logger.info(f"Schedule: Every {COLLECTION_INTERVAL_MINUTES} minutes from 4:00 AM PST")
+ logger.info(f"Stops {STOP_BEFORE_GAME_MINUTES} minutes before each game")
+
+ while True:
+ try:
+ if should_collect_now():
+ run_overtime_collection()
+ wait_until_next_interval()
+ else:
+ # Wait until 4 AM
+ now = get_current_time_pst()
+ next_start = now.replace(
+ hour=COLLECTION_START_TIME.hour,
+ minute=COLLECTION_START_TIME.minute,
+ second=0,
+ microsecond=0,
+ )
+
+ if next_start <= now:
+ # Already past 4 AM today, wait until tomorrow
+ next_start += timedelta(days=1)
+
+ wait_seconds = (next_start - now).total_seconds()
+ logger.info(
+ f"Waiting until {next_start.strftime('%I:%M %p PST')} "
+ f"({int(wait_seconds / 3600)} hours)"
+ )
+ time.sleep(wait_seconds)
+
+ except KeyboardInterrupt:
+ logger.info("Daemon stopped by user")
+ break
+ except Exception as e:
+ logger.error(f"Daemon error: {e}")
+ time.sleep(60) # Wait 1 minute before retrying
+
+
+def main() -> None:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(description="Scheduled Overtime.ag odds collection")
+ parser.add_argument(
+ "--once",
+ action="store_true",
+ help="Run single collection check (for cron)",
+ )
+ parser.add_argument(
+ "--daemon",
+ action="store_true",
+ help="Run continuous daemon mode",
+ )
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Show what would happen without collecting",
+ )
+
+ args = parser.parse_args()
+
+ if args.daemon:
+ run_daemon()
+ elif args.once or args.dry_run:
+ run_once(dry_run=args.dry_run)
+ else:
+ parser.print_help()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/pipeline/validate_prediction_quality.py b/scripts/pipeline/validate_prediction_quality.py
new file mode 100644
index 000000000..38e1b6890
--- /dev/null
+++ b/scripts/pipeline/validate_prediction_quality.py
@@ -0,0 +1,167 @@
+"""Validate that daily predictions CSV is reasonable before deployment.
+
+Checks:
+- File exists and is non-empty
+- Expected columns present
+- Game count is reasonable (warn if < 3 or > 30)
+- Score ranges are plausible (30-120 per team)
+- Probabilities in [0, 1]
+- No NaN values in critical columns
+- Total points between 80 and 200
+
+Exit code 0 = pass, 1 = fail.
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import sys
+from datetime import date
+from pathlib import Path
+
+import pandas as pd
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s [%(levelname)s] %(message)s",
+)
+logger = logging.getLogger(__name__)
+
+REQUIRED_COLUMNS = [
+ "home_team",
+ "away_team",
+ "predicted_home_score",
+ "predicted_away_score",
+ "predicted_margin",
+]
+
+SCORE_COLUMNS = ["predicted_home_score", "predicted_away_score"]
+PROB_COLUMNS = [
+ "favorite_cover_prob",
+ "underdog_cover_prob",
+ "over_prob",
+ "under_prob",
+]
+TOTAL_COLUMNS = ["predicted_total", "score_derived_total"]
+
+MIN_SCORE = 30
+MAX_SCORE = 120
+MIN_TOTAL = 80
+MAX_TOTAL = 200
+
+
+def _parse_prob_column(series: pd.Series[str]) -> pd.Series[float]:
+ """Convert probability column to float, handling '74%' style strings."""
+ if len(series) > 0 and "%" in str(series.iloc[0]):
+ return series.str.rstrip("%").astype(float) / 100.0
+ return pd.to_numeric(series, errors="coerce")
+
+
+def validate(csv_path: Path) -> list[str]:
+ """Run all validation checks and return list of failure messages."""
+ failures: list[str] = []
+
+ if not csv_path.exists():
+ failures.append(f"File not found: {csv_path}")
+ return failures
+
+ df = pd.read_csv(csv_path)
+
+ if df.empty:
+ failures.append("Predictions file is empty (0 rows)")
+ return failures
+
+ # Column check
+ missing = set(REQUIRED_COLUMNS) - set(df.columns)
+ if missing:
+ failures.append(f"Missing required columns: {missing}")
+ return failures
+
+ game_count = len(df)
+ logger.info("Game count: %d", game_count)
+ if game_count < 3:
+ logger.warning("Low game count: %d (expected >= 3)", game_count)
+ if game_count > 30:
+ logger.warning("High game count: %d (expected <= 30)", game_count)
+
+ # NaN check on critical columns (before type conversion)
+ critical = REQUIRED_COLUMNS + [c for c in PROB_COLUMNS if c in df.columns]
+ nan_counts = df[critical].isna().sum()
+ nan_cols = nan_counts[nan_counts > 0]
+ if not nan_cols.empty:
+ failures.append(f"NaN values in critical columns: {nan_cols.to_dict()}")
+
+ # Score range
+ for col in SCORE_COLUMNS:
+ if col not in df.columns:
+ continue
+ lo = df[col].min()
+ hi = df[col].max()
+ if lo < MIN_SCORE:
+ failures.append(f"{col} has value {lo:.1f} below minimum {MIN_SCORE}")
+ if hi > MAX_SCORE:
+ failures.append(f"{col} has value {hi:.1f} above maximum {MAX_SCORE}")
+
+ # Probability range [0, 1] - handles both float and "74%" string formats
+ for col in PROB_COLUMNS:
+ if col not in df.columns:
+ continue
+ parsed = _parse_prob_column(df[col].dropna())
+ if parsed.empty:
+ continue
+ lo = parsed.min()
+ hi = parsed.max()
+ if lo < 0.0 or hi > 1.0:
+ failures.append(f"{col} out of [0, 1] range: [{lo:.4f}, {hi:.4f}]")
+
+ # Total range
+ for col in TOTAL_COLUMNS:
+ if col not in df.columns:
+ continue
+ lo = df[col].min()
+ hi = df[col].max()
+ if lo < MIN_TOTAL:
+ failures.append(f"{col} has value {lo:.1f} below minimum {MIN_TOTAL}")
+ if hi > MAX_TOTAL:
+ failures.append(f"{col} has value {hi:.1f} above maximum {MAX_TOTAL}")
+
+ return failures
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(description="Validate daily prediction quality")
+ parser.add_argument(
+ "--date",
+ default=date.today().isoformat(),
+ help="Date in YYYY-MM-DD format (default: today)",
+ )
+ parser.add_argument(
+ "--predictions-dir",
+ default="predictions",
+ help="Predictions directory (default: predictions/)",
+ )
+ args = parser.parse_args()
+
+ predictions_dir = Path(args.predictions_dir)
+
+ # Try calibrated first, then raw
+ calibrated = predictions_dir / f"{args.date}_calibrated.csv"
+ raw = predictions_dir / f"{args.date}.csv"
+ csv_path = calibrated if calibrated.exists() else raw
+
+ logger.info("Validating: %s", csv_path)
+ failures = validate(csv_path)
+
+ if failures:
+ for f in failures:
+ logger.error("FAIL: %s", f)
+ logger.error("Validation FAILED with %d issue(s)", len(failures))
+ sys.exit(1)
+ else:
+ logger.info("Validation PASSED")
+ sys.exit(0)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/prediction/.gitkeep b/scripts/prediction/.gitkeep
new file mode 100644
index 000000000..e69de29bb
diff --git a/scripts/prediction/apply_calibration_fix.py b/scripts/prediction/apply_calibration_fix.py
new file mode 100644
index 000000000..50cf0a9a5
--- /dev/null
+++ b/scripts/prediction/apply_calibration_fix.py
@@ -0,0 +1,230 @@
+"""
+Apply calibration bias correction to score model predictions.
+
+This script adds a +4.5 point adjustment to total predictions to correct
+for systematic underprediction identified on 2026-02-07.
+
+Usage:
+ uv run python scripts/prediction/apply_calibration_fix.py \\
+ --input predictions/2026-02-07_raw.csv \\
+ --output predictions/2026-02-07_calibrated.csv
+
+See: docs/MODEL_CALIBRATION_FINDINGS.md for analysis
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+logger = logging.getLogger(__name__)
+
+# Calibration constants (validated on 2026-02-07)
+TOTAL_BIAS_CORRECTION = 4.5 # Model underpredicts by ~4.5 points
+SPREAD_BIAS_CORRECTION = 0.0 # Spread predictions appear unbiased
+
+# Validation thresholds
+MAX_KENPOM_DIFF = 15.0 # Flag if >15 pts from KenPom
+MAX_MARKET_DIFF = 10.0 # Flag if >10 pts from market
+MAX_RECENT_DIFF = 8.0 # Flag if >8 pts from recent avg
+
+
+def apply_calibration(
+ df: pd.DataFrame,
+ total_bias: float = TOTAL_BIAS_CORRECTION,
+) -> pd.DataFrame:
+ """
+ Apply bias correction to predictions.
+
+ Args:
+ df: Predictions DataFrame with predicted_total column
+ total_bias: Points to add to total predictions
+
+ Returns:
+ DataFrame with calibrated predictions and warnings
+ """
+ df = df.copy()
+
+ # Store original predictions
+ df["predicted_total_raw"] = df["predicted_total"]
+ df["predicted_home_score_raw"] = df["predicted_home_score"]
+ df["predicted_away_score_raw"] = df["predicted_away_score"]
+
+ # Apply calibration to total
+ df["predicted_total"] = df["predicted_total_raw"] + total_bias
+
+ # Distribute correction to home/away scores proportionally
+ total_raw = df["predicted_home_score_raw"] + df["predicted_away_score_raw"]
+ home_ratio = df["predicted_home_score_raw"] / total_raw
+ away_ratio = df["predicted_away_score_raw"] / total_raw
+
+ df["predicted_home_score"] = df["predicted_home_score_raw"] + (total_bias * home_ratio)
+ df["predicted_away_score"] = df["predicted_away_score_raw"] + (total_bias * away_ratio)
+
+ # Recalculate margin (stays same, just scores shift up)
+ df["predicted_margin"] = df["predicted_home_score"] - df["predicted_away_score"]
+
+ # Add calibration metadata
+ df["calibration_applied"] = True
+ df["total_bias_correction"] = total_bias
+
+ logger.info(f"Applied calibration: +{total_bias:.1f} points to {len(df)} predictions")
+
+ return df
+
+
+def add_validation_warnings(
+ df: pd.DataFrame,
+ kenpom_col: str | None = "kenpom_total",
+ market_col: str | None = "market_total",
+ recent_col: str | None = "recent_avg_total",
+) -> pd.DataFrame:
+ """
+ Add warning flags for predictions that deviate significantly from benchmarks.
+
+ Args:
+ df: Calibrated predictions
+ kenpom_col: Column with KenPom formula totals (optional)
+ market_col: Column with market totals (optional)
+ recent_col: Column with recent game averages (optional)
+
+ Returns:
+ DataFrame with warning columns added
+ """
+ df = df.copy()
+ warnings = []
+
+ # Check KenPom divergence
+ if kenpom_col and kenpom_col in df.columns:
+ kenpom_diff = abs(df["predicted_total"] - df[kenpom_col])
+ df["kenpom_diff"] = kenpom_diff
+ df["kenpom_warning"] = kenpom_diff > MAX_KENPOM_DIFF
+
+ kenpom_flags = df["kenpom_warning"].sum()
+ if kenpom_flags > 0:
+ warnings.append(f"{kenpom_flags} games >15pts from KenPom")
+
+ # Check market divergence
+ if market_col and market_col in df.columns:
+ market_diff = abs(df["predicted_total"] - df[market_col])
+ df["market_diff"] = market_diff
+ df["market_warning"] = market_diff > MAX_MARKET_DIFF
+
+ market_flags = df["market_warning"].sum()
+ if market_flags > 0:
+ warnings.append(f"{market_flags} games >10pts from market")
+
+ # Check recent average divergence
+ if recent_col and recent_col in df.columns:
+ recent_diff = abs(df["predicted_total"] - df[recent_col])
+ df["recent_diff"] = recent_diff
+ df["recent_warning"] = recent_diff > MAX_RECENT_DIFF
+
+ recent_flags = df["recent_warning"].sum()
+ if recent_flags > 0:
+ warnings.append(f"{recent_flags} games >8pts from recent avg")
+
+ # Overall warning flag (any benchmark triggered)
+ warning_cols = [c for c in df.columns if c.endswith("_warning")]
+ if warning_cols:
+ df["any_warning"] = df[warning_cols].any(axis=1)
+ total_warnings = df["any_warning"].sum()
+
+ if total_warnings > 0:
+ logger.warning(f"Validation warnings: {total_warnings}/{len(df)} games flagged")
+ for w in warnings:
+ logger.warning(f" - {w}")
+ else:
+ df["any_warning"] = False
+ logger.info("No validation columns available, skipping warning checks")
+
+ return df
+
+
+def main() -> None:
+ """Main execution."""
+ parser = argparse.ArgumentParser(description="Apply calibration correction to predictions")
+ parser.add_argument(
+ "--input",
+ type=Path,
+ required=True,
+ help="Input predictions CSV (raw model output)",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ required=True,
+ help="Output predictions CSV (calibrated)",
+ )
+ parser.add_argument(
+ "--bias",
+ type=float,
+ default=TOTAL_BIAS_CORRECTION,
+ help=f"Total bias correction (default: {TOTAL_BIAS_CORRECTION})",
+ )
+ parser.add_argument(
+ "--validate",
+ action="store_true",
+ help="Add validation warnings (requires benchmark columns)",
+ )
+
+ args = parser.parse_args()
+
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+
+ # Load predictions
+ logger.info(f"Loading predictions from {args.input}")
+ df = pd.read_csv(args.input)
+
+ logger.info(f"Loaded {len(df)} predictions (avg total: {df['predicted_total'].mean():.1f})")
+
+ # Apply calibration
+ df_calibrated = apply_calibration(df, total_bias=args.bias)
+
+ logger.info(f"After calibration: avg total = {df_calibrated['predicted_total'].mean():.1f}")
+
+ # Add validation warnings if requested
+ if args.validate:
+ df_calibrated = add_validation_warnings(df_calibrated)
+
+ # Save output
+ args.output.parent.mkdir(parents=True, exist_ok=True)
+ df_calibrated.to_csv(args.output, index=False)
+
+ logger.info(f"Saved calibrated predictions to {args.output}")
+
+ # Summary statistics
+ print("\n=== CALIBRATION SUMMARY ===")
+ print(f"Input file: {args.input}")
+ print(f"Output file: {args.output}")
+ print(f"Games processed: {len(df_calibrated)}")
+ print(f"Bias correction applied: +{args.bias:.1f} points")
+ print(f"\nOriginal avg total: {df['predicted_total'].mean():.1f}")
+ print(f"Calibrated avg total: {df_calibrated['predicted_total'].mean():.1f}")
+
+ if args.validate and "any_warning" in df_calibrated.columns:
+ warnings = df_calibrated["any_warning"].sum()
+ print(f"\nValidation warnings: {warnings}/{len(df_calibrated)} games flagged")
+
+ if warnings > 0:
+ print("\nGames with warnings:")
+ flagged = df_calibrated[df_calibrated["any_warning"]]
+ for _, game in flagged.iterrows():
+ matchup = f"{game['away_team']} @ {game['home_team']}"
+ pred = game["predicted_total"]
+ reasons = []
+ if game.get("kenpom_warning", False):
+ reasons.append(f"KenPom diff: {game['kenpom_diff']:.1f}")
+ if game.get("market_warning", False):
+ reasons.append(f"Market diff: {game['market_diff']:.1f}")
+ if game.get("recent_warning", False):
+ reasons.append(f"Recent diff: {game['recent_diff']:.1f}")
+
+ print(f" {matchup[:50]:50s} | Total: {pred:5.1f} | {', '.join(reasons)}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/prediction/apply_context_aware_calibration.py b/scripts/prediction/apply_context_aware_calibration.py
new file mode 100644
index 000000000..2ccee5206
--- /dev/null
+++ b/scripts/prediction/apply_context_aware_calibration.py
@@ -0,0 +1,284 @@
+"""
+Apply context-aware calibration to predictions based on game characteristics.
+
+Instead of blanket adjustment, calibrates based on:
+- Scoring range (low/mid/high)
+- Pace (slow/moderate/fast)
+- Team quality (elite defense, mismatches)
+
+Usage:
+ uv run python scripts/prediction/apply_context_aware_calibration.py \\
+ --input predictions/2026-02-08_fresh.csv \\
+ --output predictions/2026-02-08_context_calibrated.csv
+
+See: docs/CALIBRATION_EXPERT_GUIDE.md for methodology
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+logger = logging.getLogger(__name__)
+
+# Calibration constants (from validation data)
+# These will be updated monthly based on empirical results
+
+# Scoring range adjustments
+BIAS_LOW_SCORING = -10.0 # Model overpredicts low-scoring games (<135)
+BIAS_MID_SCORING = -1.5 # Slight overall overprediction (135-155)
+BIAS_HIGH_SCORING = +5.0 # Model underpredicts high-scoring games (>160)
+
+# Pace adjustments
+BIAS_FAST_PACE = +1.0 # Fast games (>70 tempo) run higher
+BIAS_SLOW_PACE = -2.0 # Slow games (<65 tempo) grind lower
+
+# Team quality adjustments
+BIAS_ELITE_DEFENSE = -3.0 # Both teams AdjD < 95 (elite defense suppresses)
+BIAS_MISMATCH = +2.0 # AdjEM diff > 15 (blowouts run higher)
+
+# Thresholds
+LOW_SCORING_THRESHOLD = 135
+HIGH_SCORING_THRESHOLD = 160
+SLOW_PACE_THRESHOLD = 65
+FAST_PACE_THRESHOLD = 70
+ELITE_DEFENSE_THRESHOLD = 95
+MISMATCH_THRESHOLD = 15
+
+
+def calculate_context_adjustment(row: pd.Series, calibration_config: dict | None = None) -> float:
+ """
+ Calculate calibration adjustment based on game context.
+
+ Args:
+ row: Prediction row with game features
+ calibration_config: Optional dict to override default constants
+
+ Returns:
+ Adjustment to add to predicted_total (can be negative)
+ """
+ if calibration_config is None:
+ calibration_config = {}
+
+ adjustment = 0.0
+ reasons = []
+
+ predicted_total = row["predicted_total"]
+
+ # 1. Scoring range adjustment (most important)
+ if predicted_total < LOW_SCORING_THRESHOLD:
+ bias = calibration_config.get("bias_low_scoring", BIAS_LOW_SCORING)
+ adjustment += bias
+ reasons.append(f"low_scoring:{bias:+.1f}")
+ elif predicted_total > HIGH_SCORING_THRESHOLD:
+ bias = calibration_config.get("bias_high_scoring", BIAS_HIGH_SCORING)
+ adjustment += bias
+ reasons.append(f"high_scoring:{bias:+.1f}")
+ else:
+ bias = calibration_config.get("bias_mid_scoring", BIAS_MID_SCORING)
+ adjustment += bias
+ reasons.append(f"mid_scoring:{bias:+.1f}")
+
+ # 2. Pace adjustment (if tempo features available)
+ if "avg_tempo" in row.index and pd.notna(row["avg_tempo"]):
+ avg_tempo = row["avg_tempo"]
+ if avg_tempo > FAST_PACE_THRESHOLD:
+ bias = calibration_config.get("bias_fast_pace", BIAS_FAST_PACE)
+ adjustment += bias
+ reasons.append(f"fast_pace:{bias:+.1f}")
+ elif avg_tempo < SLOW_PACE_THRESHOLD:
+ bias = calibration_config.get("bias_slow_pace", BIAS_SLOW_PACE)
+ adjustment += bias
+ reasons.append(f"slow_pace:{bias:+.1f}")
+
+ # 3. Elite defense adjustment (if defense features available)
+ if (
+ "home_adj_d" in row.index
+ and "away_adj_d" in row.index
+ and pd.notna(row["home_adj_d"])
+ and pd.notna(row["away_adj_d"])
+ ):
+ both_elite_defense = (
+ row["home_adj_d"] < ELITE_DEFENSE_THRESHOLD
+ and row["away_adj_d"] < ELITE_DEFENSE_THRESHOLD
+ )
+ if both_elite_defense:
+ bias = calibration_config.get("bias_elite_defense", BIAS_ELITE_DEFENSE)
+ adjustment += bias
+ reasons.append(f"elite_defense:{bias:+.1f}")
+
+ # 4. Mismatch adjustment (if EM diff available)
+ if (
+ "home_adj_em" in row.index
+ and "away_adj_em" in row.index
+ and pd.notna(row["home_adj_em"])
+ and pd.notna(row["away_adj_em"])
+ ):
+ em_diff = abs(row["home_adj_em"] - row["away_adj_em"])
+ if em_diff > MISMATCH_THRESHOLD:
+ bias = calibration_config.get("bias_mismatch", BIAS_MISMATCH)
+ adjustment += bias
+ reasons.append(f"mismatch:{bias:+.1f}")
+
+ return adjustment, reasons
+
+
+def apply_context_aware_calibration(
+ df: pd.DataFrame, calibration_config: dict | None = None
+) -> pd.DataFrame:
+ """
+ Apply context-aware calibration to all predictions.
+
+ Args:
+ df: Predictions DataFrame
+ calibration_config: Optional calibration constants override
+
+ Returns:
+ DataFrame with calibrated predictions and adjustment details
+ """
+ df = df.copy()
+
+ # Store original predictions
+ df["predicted_total_raw"] = df["predicted_total"]
+ df["predicted_home_score_raw"] = df["predicted_home_score"]
+ df["predicted_away_score_raw"] = df["predicted_away_score"]
+
+ # Calculate adjustments
+ adjustments = []
+ adjustment_reasons = []
+
+ for _idx, row in df.iterrows():
+ adj, reasons = calculate_context_adjustment(row, calibration_config)
+ adjustments.append(adj)
+ adjustment_reasons.append(" | ".join(reasons) if reasons else "baseline")
+
+ df["calibration_adjustment"] = adjustments
+ df["calibration_reasons"] = adjustment_reasons
+
+ # Apply calibration to total
+ df["predicted_total"] = df["predicted_total_raw"] + df["calibration_adjustment"]
+
+ # Distribute adjustment to home/away scores proportionally
+ total_raw = df["predicted_home_score_raw"] + df["predicted_away_score_raw"]
+ home_ratio = df["predicted_home_score_raw"] / total_raw
+ away_ratio = df["predicted_away_score_raw"] / total_raw
+
+ df["predicted_home_score"] = df["predicted_home_score_raw"] + (
+ df["calibration_adjustment"] * home_ratio
+ )
+ df["predicted_away_score"] = df["predicted_away_score_raw"] + (
+ df["calibration_adjustment"] * away_ratio
+ )
+
+ # Recalculate margin (stays same, just scores shift)
+ df["predicted_margin"] = df["predicted_home_score"] - df["predicted_away_score"]
+
+ # Summary statistics
+ logger.info(f"Applied context-aware calibration to {len(df)} predictions")
+ logger.info(f"Average adjustment: {df['calibration_adjustment'].mean():+.2f} points")
+ logger.info(
+ f"Adjustment range: {df['calibration_adjustment'].min():+.1f} to "
+ f"{df['calibration_adjustment'].max():+.1f}"
+ )
+
+ # Breakdown by adjustment type
+ low_scoring_count = (df["predicted_total_raw"] < LOW_SCORING_THRESHOLD).sum()
+ high_scoring_count = (df["predicted_total_raw"] > HIGH_SCORING_THRESHOLD).sum()
+ mid_scoring_count = len(df) - low_scoring_count - high_scoring_count
+
+ logger.info(
+ f"Scoring breakdown: {low_scoring_count} low, "
+ f"{mid_scoring_count} mid, {high_scoring_count} high"
+ )
+
+ return df
+
+
+def main() -> None:
+ """Main execution."""
+ parser = argparse.ArgumentParser(description="Apply context-aware calibration to predictions")
+ parser.add_argument("--input", type=Path, required=True, help="Input predictions CSV (raw)")
+ parser.add_argument(
+ "--output",
+ type=Path,
+ required=True,
+ help="Output predictions CSV (context-calibrated)",
+ )
+ parser.add_argument(
+ "--config",
+ type=Path,
+ help="Optional JSON config file with calibration constants",
+ )
+ parser.add_argument("--verbose", action="store_true", help="Show detailed adjustment breakdown")
+
+ args = parser.parse_args()
+
+ logging.basicConfig(
+ level=logging.INFO if not args.verbose else logging.DEBUG,
+ format="%(levelname)s - %(message)s",
+ )
+
+ # Load predictions
+ logger.info(f"Loading predictions from {args.input}")
+ df = pd.read_csv(args.input)
+ logger.info(f"Loaded {len(df)} predictions (avg total: {df['predicted_total'].mean():.1f})")
+
+ # Load calibration config if provided
+ calibration_config = None
+ if args.config:
+ import json
+
+ with open(args.config) as f:
+ calibration_config = json.load(f)
+ logger.info(f"Loaded calibration config from {args.config}")
+
+ # Apply calibration
+ df_calibrated = apply_context_aware_calibration(df, calibration_config)
+
+ logger.info(f"After calibration: avg total = {df_calibrated['predicted_total'].mean():.1f}")
+
+ # Save output
+ args.output.parent.mkdir(parents=True, exist_ok=True)
+ df_calibrated.to_csv(args.output, index=False)
+ logger.info(f"Saved calibrated predictions to {args.output}")
+
+ # Print summary
+ print("\n=== CONTEXT-AWARE CALIBRATION SUMMARY ===")
+ print(f"Input file: {args.input}")
+ print(f"Output file: {args.output}")
+ print(f"Games processed: {len(df_calibrated)}")
+ print(f"\nOriginal avg total: {df['predicted_total'].mean():.1f}")
+ print(f"Calibrated avg total: {df_calibrated['predicted_total'].mean():.1f}")
+ print(f"Average adjustment: {df_calibrated['calibration_adjustment'].mean():+.2f}")
+
+ if args.verbose:
+ print("\n=== SAMPLE ADJUSTMENTS ===")
+ # Show a few examples from each category
+ for category, threshold_low, threshold_high in [
+ ("Low Scoring", 0, LOW_SCORING_THRESHOLD),
+ ("Mid Scoring", LOW_SCORING_THRESHOLD, HIGH_SCORING_THRESHOLD),
+ ("High Scoring", HIGH_SCORING_THRESHOLD, 300),
+ ]:
+ subset = df_calibrated[
+ (df_calibrated["predicted_total_raw"] >= threshold_low)
+ & (df_calibrated["predicted_total_raw"] < threshold_high)
+ ]
+ if len(subset) > 0:
+ print(f"\n{category} Games (N={len(subset)}):")
+ sample = subset.head(3)
+ for _, game in sample.iterrows():
+ matchup = f"{game['away_team']} @ {game['home_team']}"
+ print(f" {matchup[:50]:50s}")
+ print(
+ f" Raw: {game['predicted_total_raw']:.1f} | "
+ f"Calibrated: {game['predicted_total']:.1f} | "
+ f"Adjustment: {game['calibration_adjustment']:+.1f}"
+ )
+ print(f" Reasons: {game['calibration_reasons']}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/prediction/deploy_simple_predictions.py b/scripts/prediction/deploy_simple_predictions.py
new file mode 100644
index 000000000..435206f87
--- /dev/null
+++ b/scripts/prediction/deploy_simple_predictions.py
@@ -0,0 +1,121 @@
+#!/usr/bin/env python3
+"""Simplified deployment using only KenPom baseline features.
+
+Uses KenPom efficiency margins as the primary predictor, which has
+shown strong performance in our edge analysis.
+"""
+
+import sys
+from datetime import date
+from pathlib import Path
+
+import pandas as pd
+
+ANALYSIS_DIR = Path("data/analysis")
+MIN_KENPOM_EDGE = 3.5 # Minimum KenPom vs market discrepancy
+
+
+def american_to_prob(odds: int) -> float:
+ """Convert American odds to implied probability."""
+ if odds < 0:
+ return abs(odds) / (abs(odds) + 100)
+ return 100 / (odds + 100)
+
+
+def prob_to_american(prob: float) -> int:
+ """Convert probability to American odds."""
+ if prob >= 0.5:
+ return int(-prob / (1 - prob) * 100)
+ return int((1 - prob) / prob * 100)
+
+
+print("=" * 80)
+print("KENPOM-BASED DEPLOYMENT - TODAY'S GAMES")
+print("=" * 80)
+
+# Load today's edge analysis (use CORRECTED file)
+today = date.today().isoformat()
+edges_path = ANALYSIS_DIR / f"edge_analysis_{today}_CORRECTED.csv"
+
+if not edges_path.exists():
+ print(f"\n[ERROR] Edge analysis not found: {edges_path}")
+ print("Run edge finding first")
+ sys.exit(1)
+
+df = pd.read_csv(edges_path)
+print(f"\n[OK] Loaded {len(df)} games with KenPom analysis")
+
+# Calculate market margin (convert spread to same convention as KenPom)
+df["market_margin"] = -1 * df["home_spread_num"]
+
+# Recalculate discrepancy (KenPom - Market)
+df["discrepancy"] = df["kenpom_margin"] - df["market_margin"]
+df["abs_discrepancy"] = df["discrepancy"].abs()
+
+# Filter to actionable edges (KenPom disagrees with market by 3.5+ points)
+edges = df[df["abs_discrepancy"] >= MIN_KENPOM_EDGE].copy()
+
+print(f"[OK] Found {len(edges)} games with {MIN_KENPOM_EDGE}+ point KenPom edges")
+
+if len(edges) == 0:
+ print("\n[INFO] No strong edges found today")
+ sys.exit(0)
+
+# Sort by edge magnitude
+edges = edges.sort_values("abs_discrepancy", ascending=False)
+
+print("\n" + "=" * 80)
+print("RECOMMENDED PLAYS (Based on KenPom Analysis)")
+print("=" * 80)
+
+for _idx, game in edges.iterrows():
+ print(f"\n{game['game_time']}")
+ print(f"{game['away_team']} @ {game['home_team']}")
+
+ disc = game["discrepancy"]
+ abs_disc = game["abs_discrepancy"]
+
+ if disc > 0:
+ # KenPom favors home more than market
+ pick = game["home_team"]
+ line = game["home_spread_num"]
+ odds = game["home_spread_juice"]
+ reason = f"KenPom projects {pick} by {game['kenpom_margin']:+.1f}, market only {line:+.1f}"
+ else:
+ # KenPom favors away more than market
+ pick = game["away_team"]
+ line = -game["home_spread_num"]
+ odds = game["away_spread_juice"]
+ kp_margin = abs(game["kenpom_margin"])
+ market_line = abs(line)
+ reason = f"KenPom projects {pick} by {kp_margin:.1f}, market only {market_line:.1f}"
+
+ # Calculate implied edge
+ # Assume KenPom is "true" probability
+ # Edge = difference in point spread / typical point value (~2.5 pts = 10% prob)
+ prob_edge = abs_disc / 2.5 * 0.10 # Rough conversion
+
+ print(f" PLAY: {pick} {line:+.1f} ({odds:+.0f})")
+ print(f" Reason: {reason}")
+ print(f" KenPom Edge: {abs_disc:.1f} points")
+ print(f" Est. Probability Edge: ~{prob_edge:.1%}")
+ print(f" Strength: {game['edge_strength']}")
+
+# Summary
+print("\n" + "=" * 80)
+print("SUMMARY")
+print("=" * 80)
+print(f"Total recommended plays: {len(edges)}")
+print(f"Average KenPom edge: {edges['abs_discrepancy'].mean():.1f} points")
+print(f"Largest edge: {edges['abs_discrepancy'].max():.1f} points")
+
+print("\n[APPROACH]")
+print("These plays are based on KenPom efficiency margins vs market spreads.")
+print("KenPom has historically been more accurate than opening lines.")
+print("Edges of 3.5+ points represent significant market inefficiencies.")
+
+print("\n[RISK MANAGEMENT]")
+print("- Start with small units (0.5-1% bankroll)")
+print("- Track results to validate approach")
+print("- Focus on games with 5+ point edges for highest confidence")
+print("- Monitor line movements (sharp money confirmation)")
diff --git a/scripts/prediction/deploy_today_predictions.py b/scripts/prediction/deploy_today_predictions.py
new file mode 100644
index 000000000..dca167d83
--- /dev/null
+++ b/scripts/prediction/deploy_today_predictions.py
@@ -0,0 +1,429 @@
+#!/usr/bin/env python3
+"""Deploy models to generate predictions for today's games.
+
+Loads trained models, applies to today's games with KenPom features,
+calculates expected value (EV) vs market odds, and recommends plays.
+
+Usage:
+ uv run python scripts/deploy_today_predictions.py
+ uv run python scripts/deploy_today_predictions.py --min-ev 0.02 # 2% edge minimum
+"""
+
+import argparse
+import sys
+from datetime import date
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+import xgboost as xgb
+
+# Configuration
+MODELS_DIR = Path("data/models")
+ANALYSIS_DIR = Path("data/analysis")
+MIN_EV_DEFAULT = 0.015 # 1.5% minimum expected value (ROI)
+MIN_PROBABILITY = 0.45 # Don't recommend extreme longshots
+MAX_PROBABILITY = 0.70 # Don't recommend heavy favorites (juice too high)
+
+print("=" * 80)
+print("MODEL DEPLOYMENT - TODAY'S GAMES")
+print("=" * 80)
+
+
+def american_to_implied_prob(odds: int) -> float:
+ """Convert American odds to implied probability."""
+ if odds < 0:
+ return abs(odds) / (abs(odds) + 100)
+ else:
+ return 100 / (odds + 100)
+
+
+def prob_to_american_odds(prob: float) -> int:
+ """Convert probability to fair American odds."""
+ if prob >= 0.5:
+ return int(-prob / (1 - prob) * 100)
+ else:
+ return int((1 - prob) / prob * 100)
+
+
+def calculate_ev(model_prob: float, market_odds: int) -> float:
+ """Calculate expected value (ROI) for a bet.
+
+ Args:
+ model_prob: Model's predicted probability of winning
+ market_odds: American odds offered by bookmaker
+
+ Returns:
+ Expected value as decimal (0.05 = 5% ROI)
+ """
+ if market_odds < 0:
+ # Favorite - risk abs(odds) to win 100
+ win_amount = 100 / abs(market_odds)
+ return (model_prob * win_amount) - ((1 - model_prob) * 1)
+ else:
+ # Underdog - risk 100 to win odds
+ win_amount = market_odds / 100
+ return (model_prob * win_amount) - ((1 - model_prob) * 1)
+
+
+def load_models():
+ """Load trained models."""
+ print("\n[1/5] Loading trained models...")
+
+ spreads_model_path = MODELS_DIR / "spreads_model.json"
+ totals_model_path = MODELS_DIR / "totals_model.json"
+
+ if not spreads_model_path.exists():
+ print(f" [ERROR] Spreads model not found: {spreads_model_path}")
+ print(" Run training first: uv run python scripts/walk_forward_training.py")
+ return None, None
+
+ spreads_model = xgb.XGBClassifier()
+ spreads_model.load_model(str(spreads_model_path))
+ print(" [OK] Loaded spreads model")
+
+ totals_model = None
+ if totals_model_path.exists():
+ totals_model = xgb.XGBClassifier()
+ totals_model.load_model(str(totals_model_path))
+ print(" [OK] Loaded totals model")
+ else:
+ print(" [WARN] Totals model not found, spreads only")
+
+ return spreads_model, totals_model
+
+
+def load_todays_games():
+ """Load today's games with odds and KenPom data."""
+ print("\n[2/5] Loading today's games...")
+
+ today = date.today().isoformat()
+ games_path = ANALYSIS_DIR / f"complete_analysis_{today}_main_lines.csv"
+
+ if not games_path.exists():
+ print(f" [ERROR] Today's games not found: {games_path}")
+ print(" Run data merge first")
+ return None
+
+ df = pd.read_csv(games_path)
+ print(f" [OK] Loaded {len(df)} games")
+
+ # Filter to games with KenPom data
+ df_kp = df[df["kenpom_margin"].notna()].copy()
+ print(f" [OK] {len(df_kp)} games with KenPom data")
+
+ return df_kp
+
+
+def prepare_spreads_features(df):
+ """Prepare features for spreads model (favorite/underdog perspective)."""
+ print("\n[3/5] Preparing spreads features...")
+
+ features_list = []
+
+ for idx, row in df.iterrows():
+ home_spread = row["home_spread"]
+
+ # Determine favorite/underdog
+ is_home_fav = home_spread < 0
+
+ if is_home_fav:
+ # Home is favorite
+ fav_adjoe = row["home_adjoe"]
+ fav_adjde = row["home_adjde"]
+ fav_adjem = row["home_adjem"]
+ fav_tempo = row["home_tempo"]
+ dog_adjoe = row["away_adjoe"]
+ dog_adjde = row["away_adjde"]
+ dog_adjem = row["away_adjem"]
+ dog_tempo = row["away_tempo"]
+ else:
+ # Away is favorite
+ fav_adjoe = row["away_adjoe"]
+ fav_adjde = row["away_adjde"]
+ fav_adjem = row["away_adjem"]
+ fav_tempo = row["away_tempo"]
+ dog_adjoe = row["home_adjoe"]
+ dog_adjde = row["home_adjde"]
+ dog_adjem = row["home_adjem"]
+ dog_tempo = row["home_tempo"]
+
+ # Build feature dict matching training data format
+ features = {
+ "game_idx": idx,
+ "fav_adj_em": fav_adjem,
+ "fav_adj_o": fav_adjoe,
+ "fav_adj_d": fav_adjde,
+ "fav_adj_t": fav_tempo,
+ "dog_adj_em": dog_adjem,
+ "dog_adj_o": dog_adjoe,
+ "dog_adj_d": dog_adjde,
+ "dog_adj_t": dog_tempo,
+ "em_diff": fav_adjem - dog_adjem,
+ "closing_spread": abs(home_spread),
+ }
+
+ features_list.append(features)
+
+ features_df = pd.DataFrame(features_list)
+ print(f" [OK] Prepared {len(features_df)} games with {len(features_df.columns) - 1} features")
+
+ return features_df
+
+
+def prepare_totals_features(df):
+ """Prepare features for totals model (home/away perspective)."""
+ print("\n[3/5] Preparing totals features...")
+
+ features_list = []
+
+ for idx, row in df.iterrows():
+ # Build feature dict matching training data format
+ features = {
+ "game_idx": idx,
+ "away_adj_em": row["away_adjem"],
+ "away_adj_o": row["away_adjoe"],
+ "away_adj_d": row["away_adjde"],
+ "away_adj_t": row["away_tempo"],
+ "home_adj_em": row["home_adjem"],
+ "home_adj_o": row["home_adjoe"],
+ "home_adj_d": row["home_adjde"],
+ "home_adj_t": row["home_tempo"],
+ "tempo_avg": (row["away_tempo"] + row["home_tempo"]) / 2,
+ "closing_total": row["total"],
+ }
+
+ features_list.append(features)
+
+ features_df = pd.DataFrame(features_list)
+ print(f" [OK] Prepared {len(features_df)} games with {len(features_df.columns) - 1} features")
+
+ return features_df
+
+
+def generate_predictions(spreads_model, totals_model, games_df):
+ """Generate predictions for all games."""
+ print("\n[4/5] Generating predictions...")
+
+ predictions = []
+
+ # Spreads predictions
+ if spreads_model is not None and len(games_df) > 0:
+ spreads_features = prepare_spreads_features(games_df)
+
+ # Get feature columns (exclude game_idx)
+ feature_cols = [col for col in spreads_features.columns if col != "game_idx"]
+ X = spreads_features[feature_cols].fillna(0)
+
+ # Predict
+ probs = spreads_model.predict_proba(X)[:, 1] # Probability favorite covers
+
+ for i, (_idx, row) in enumerate(games_df.iterrows()):
+ fav_cover_prob = probs[i]
+ home_spread = row["home_spread"]
+ is_home_fav = home_spread < 0
+
+ # Determine recommended side
+ if is_home_fav:
+ # Home is favorite
+ home_win_prob = fav_cover_prob
+ away_win_prob = 1 - fav_cover_prob
+ home_odds = row["home_spread_juice"]
+ away_odds = row["away_spread_juice"]
+ else:
+ # Away is favorite
+ away_win_prob = fav_cover_prob
+ home_win_prob = 1 - fav_cover_prob
+ away_odds = row["away_spread_juice"]
+ home_odds = row["home_spread_juice"]
+
+ # Calculate EV for both sides
+ home_ev = calculate_ev(home_win_prob, home_odds) if not np.isnan(home_odds) else -999
+ away_ev = calculate_ev(away_win_prob, away_odds) if not np.isnan(away_odds) else -999
+
+ # Recommend side with positive EV
+ if home_ev > away_ev and home_ev > 0:
+ predictions.append(
+ {
+ "game_time": row["game_time"],
+ "away_team": row["away_team"],
+ "home_team": row["home_team"],
+ "bet_type": "SPREAD",
+ "pick": row["home_team"],
+ "line": f"{row['home_spread']:+.1f}",
+ "odds": int(home_odds),
+ "model_prob": home_win_prob,
+ "fair_odds": prob_to_american_odds(home_win_prob),
+ "ev": home_ev,
+ "roi_pct": home_ev * 100,
+ }
+ )
+ elif away_ev > 0:
+ predictions.append(
+ {
+ "game_time": row["game_time"],
+ "away_team": row["away_team"],
+ "home_team": row["home_team"],
+ "bet_type": "SPREAD",
+ "pick": row["away_team"],
+ "line": f"{-row['home_spread']:+.1f}",
+ "odds": int(away_odds),
+ "model_prob": away_win_prob,
+ "fair_odds": prob_to_american_odds(away_win_prob),
+ "ev": away_ev,
+ "roi_pct": away_ev * 100,
+ }
+ )
+
+ # Totals predictions
+ if totals_model is not None and len(games_df) > 0:
+ totals_features = prepare_totals_features(games_df)
+
+ feature_cols = [col for col in totals_features.columns if col != "game_idx"]
+ X = totals_features[feature_cols].fillna(0)
+
+ probs = totals_model.predict_proba(X)[:, 1] # Probability over hits
+
+ for i, (_idx, row) in enumerate(games_df.iterrows()):
+ over_prob = probs[i]
+ under_prob = 1 - over_prob
+
+ over_odds = row["over_juice"]
+ under_odds = row["under_juice"]
+
+ if pd.isna(over_odds) or pd.isna(under_odds):
+ continue
+
+ over_ev = calculate_ev(over_prob, over_odds)
+ under_ev = calculate_ev(under_prob, under_odds)
+
+ if over_ev > under_ev and over_ev > 0:
+ predictions.append(
+ {
+ "game_time": row["game_time"],
+ "away_team": row["away_team"],
+ "home_team": row["home_team"],
+ "bet_type": "TOTAL",
+ "pick": "OVER",
+ "line": f"O{row['total']:.1f}",
+ "odds": int(over_odds),
+ "model_prob": over_prob,
+ "fair_odds": prob_to_american_odds(over_prob),
+ "ev": over_ev,
+ "roi_pct": over_ev * 100,
+ }
+ )
+ elif under_ev > 0:
+ predictions.append(
+ {
+ "game_time": row["game_time"],
+ "away_team": row["away_team"],
+ "home_team": row["home_team"],
+ "bet_type": "TOTAL",
+ "pick": "UNDER",
+ "line": f"U{row['total']:.1f}",
+ "odds": int(under_odds),
+ "model_prob": under_prob,
+ "fair_odds": prob_to_american_odds(under_prob),
+ "ev": under_ev,
+ "roi_pct": under_ev * 100,
+ }
+ )
+
+ print(f" [OK] Generated {len(predictions)} predictions")
+ return pd.DataFrame(predictions)
+
+
+def display_recommendations(predictions_df, min_ev):
+ """Display ranked recommendations."""
+ print("\n[5/5] Recommendations (Ranked by EV)...")
+ print("=" * 80)
+
+ # Filter by minimum EV and probability bounds
+ filtered = predictions_df[
+ (predictions_df["ev"] >= min_ev)
+ & (predictions_df["model_prob"] >= MIN_PROBABILITY)
+ & (predictions_df["model_prob"] <= MAX_PROBABILITY)
+ ]
+
+ if len(filtered) == 0:
+ print("\n[INFO] No plays meet EV threshold")
+ print(f" Minimum EV: {min_ev * 100:.1f}%")
+ print(" Try lowering threshold: --min-ev 0.01")
+ return
+
+ # Sort by EV descending
+ filtered = filtered.sort_values("ev", ascending=False)
+
+ print(f"\n{len(filtered)} RECOMMENDED PLAYS (min EV: {min_ev * 100:.1f}%)")
+ print("=" * 80)
+
+ for _i, row in filtered.iterrows():
+ print(f"\n{row['game_time']}")
+ print(f"{row['away_team']} @ {row['home_team']}")
+ print(f" BET: {row['pick']} {row['line']} ({row['odds']:+d})")
+ print(f" Model probability: {row['model_prob']:.1%}")
+ print(f" Fair odds: {row['fair_odds']:+d}")
+ print(f" Market odds: {row['odds']:+d}")
+ print(f" Expected Value: {row['roi_pct']:+.2f}% ROI")
+ print(f" Type: {row['bet_type']}")
+
+ # Save to file
+ output_path = ANALYSIS_DIR / f"predictions_{date.today().isoformat()}.csv"
+ predictions_df.to_csv(output_path, index=False)
+ print(f"\n[SAVED] All predictions -> {output_path}")
+
+ # Summary
+ print("\n" + "=" * 80)
+ print("SUMMARY")
+ print("=" * 80)
+ print(f"Total plays recommended: {len(filtered)}")
+ print(f"Average EV: {filtered['ev'].mean() * 100:+.2f}% ROI")
+ best_play = filtered.iloc[0]
+ best_desc = f"{best_play['pick']} {best_play['line']}"
+ best_ev = f"{best_play['roi_pct']:+.1f}% EV"
+ print(f"Best play: {best_desc} ({best_ev})")
+
+ spreads_count = len(filtered[filtered["bet_type"] == "SPREAD"])
+ totals_count = len(filtered[filtered["bet_type"] == "TOTAL"])
+ print("\nBy type:")
+ print(f" Spreads: {spreads_count}")
+ print(f" Totals: {totals_count}")
+
+
+def main():
+ """Main deployment pipeline."""
+ parser = argparse.ArgumentParser(description="Deploy models for today's games")
+ parser.add_argument(
+ "--min-ev",
+ type=float,
+ default=MIN_EV_DEFAULT,
+ help=f"Minimum expected value (ROI) to recommend (default: {MIN_EV_DEFAULT})",
+ )
+ args = parser.parse_args()
+
+ # Load models
+ spreads_model, totals_model = load_models()
+ if spreads_model is None:
+ return 1
+
+ # Load today's games
+ games_df = load_todays_games()
+ if games_df is None or len(games_df) == 0:
+ return 1
+
+ # Generate predictions
+ predictions_df = generate_predictions(spreads_model, totals_model, games_df)
+
+ if len(predictions_df) == 0:
+ print("\n[INFO] No positive EV plays found")
+ return 0
+
+ # Display recommendations
+ display_recommendations(predictions_df, args.min_ev)
+
+ return 0
+
+
+if __name__ == "__main__":
+ sys.exit(main())
diff --git a/scripts/prediction/generate_betting_card.py b/scripts/prediction/generate_betting_card.py
new file mode 100644
index 000000000..f6b3238db
--- /dev/null
+++ b/scripts/prediction/generate_betting_card.py
@@ -0,0 +1,473 @@
+"""Generate daily betting card with model predictions vs market lines.
+
+Creates a clear, concise betting card showing:
+- Game matchups with times
+- Market lines (FanDuel odds)
+- Model's projected spread and total
+- Edge/value opportunities highlighted
+
+Usage:
+ uv run python scripts/generate_betting_card.py
+ uv run python scripts/generate_betting_card.py --date 2026-02-01
+ uv run python scripts/generate_betting_card.py --output betting-cards/2026-02-01.txt
+"""
+
+from __future__ import annotations
+
+import argparse
+from datetime import date
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+
+# ANSI color codes for terminal output
+RESET = "\033[0m"
+BOLD = "\033[1m"
+GREEN = "\033[92m"
+RED = "\033[91m"
+YELLOW = "\033[93m"
+CYAN = "\033[96m"
+
+
+def estimate_model_spread(market_spread: float, fav_cover_prob: float) -> float:
+ """Estimate the model's fair spread based on cover probability.
+
+ Uses approximation: each 10% edge ≈ 2.5 points of spread difference.
+
+ Args:
+ market_spread: Current market spread (positive value)
+ fav_cover_prob: Probability of favorite covering market spread
+
+ Returns:
+ Model's estimated fair spread (positive = favorite)
+ """
+ # Edge from break-even (52.4% at -110 odds)
+ edge = fav_cover_prob - 0.524
+
+ # Approximate point adjustment: 10% edge ≈ 2.5 points
+ point_adjustment = edge * 25.0
+
+ # Model's fair spread
+ model_spread = market_spread + point_adjustment
+
+ # Round to nearest 0.5
+ return round(model_spread * 2) / 2
+
+
+def get_model_pick(prob: float, threshold: float = 0.524) -> tuple[str, str]:
+ """Determine model's pick based on probability vs break-even.
+
+ Args:
+ prob: Probability of favorite covering (0-1)
+ threshold: Break-even probability at -110 odds (default: 0.524)
+
+ Returns:
+ (side, confidence) tuple where side is 'FAV', 'DOG', or 'PASS'
+ and confidence is strength descriptor
+ """
+ if prob > threshold + 0.10: # > 62.4%
+ return "FAV", "STRONG"
+ elif prob > threshold: # > 52.4%
+ return "FAV", "LEAN"
+ elif prob < threshold - 0.10: # < 42.4%
+ return "DOG", "STRONG"
+ elif prob < threshold: # < 52.4%
+ return "DOG", "LEAN"
+ else:
+ return "PASS", "NEUTRAL"
+
+
+def estimate_model_total(market_total: float, over_prob: float) -> float:
+ """Estimate the model's fair total based on over probability.
+
+ Uses approximation: each 10% edge ≈ 3-4 points of total difference.
+
+ Args:
+ market_total: Current market total
+ over_prob: Probability of going over market total
+
+ Returns:
+ Model's estimated fair total
+ """
+ # Edge from break-even (52.4% at -110 odds)
+ edge = over_prob - 0.524
+
+ # Approximate point adjustment: 10% edge ≈ 3.5 points
+ point_adjustment = edge * 35.0
+
+ # Model's fair total
+ model_total = market_total + point_adjustment
+
+ # Round to nearest 0.5
+ return round(model_total * 2) / 2
+
+
+def get_total_pick(prob: float, threshold: float = 0.524) -> tuple[str, str]:
+ """Determine model's total pick based on probability vs break-even.
+
+ Args:
+ prob: Probability of going over (0-1)
+ threshold: Break-even probability at -110 odds (default: 0.524)
+
+ Returns:
+ (side, confidence) tuple where side is 'OVER', 'UNDER', or 'PASS'
+ """
+ if prob > threshold + 0.10: # > 62.4%
+ return "OVER", "STRONG"
+ elif prob > threshold: # > 52.4%
+ return "OVER", "LEAN"
+ elif prob < threshold - 0.10: # < 42.4%
+ return "UNDER", "STRONG"
+ elif prob < threshold: # < 52.4%
+ return "UNDER", "LEAN"
+ else:
+ return "PASS", "NEUTRAL"
+
+
+def format_edge(edge: float, threshold: float = 0.05) -> str:
+ """Format edge with color coding.
+
+ Args:
+ edge: Edge value (-1 to 1)
+ threshold: Minimum edge to highlight
+
+ Returns:
+ Formatted string with color
+ """
+ edge_pct = edge * 100
+
+ if abs(edge) < threshold:
+ return f"{edge_pct:+5.1f}%"
+ elif edge > 0:
+ return f"{GREEN}{edge_pct:+5.1f}%{RESET}"
+ else:
+ return f"{RED}{edge_pct:+5.1f}%{RESET}"
+
+
+def format_spread_comparison(
+ favorite: str,
+ underdog: str,
+ market_spread: float,
+ fav_prob: float,
+ edge: float,
+) -> str:
+ """Format spread comparison line.
+
+ Args:
+ favorite: Favorite team name
+ underdog: Underdog team name
+ market_spread: Market spread magnitude
+ fav_prob: Probability of favorite covering
+ edge: Model edge
+
+ Returns:
+ Formatted comparison string
+ """
+ # Calculate model's estimated fair spread
+ model_spread = estimate_model_spread(market_spread, fav_prob)
+
+ # Determine model's pick
+ pick_side, confidence = get_model_pick(fav_prob)
+
+ if pick_side == "FAV":
+ if confidence == "STRONG":
+ color = GREEN
+ indicator = "FAV**"
+ else:
+ color = CYAN
+ indicator = "FAV*"
+ elif pick_side == "DOG":
+ if confidence == "STRONG":
+ color = YELLOW
+ indicator = "DOG**"
+ else:
+ color = ""
+ indicator = "DOG*"
+ else:
+ color = ""
+ indicator = "PASS"
+
+ # Format model spread display
+ if model_spread > 0:
+ model_display = f"{favorite[:12]:12} -{model_spread:4.1f}"
+ else:
+ model_display = f"{underdog[:12]:12} -{abs(model_spread):4.1f}"
+
+ return (
+ f" Spread: {favorite[:18]:18} -{market_spread:4.1f} "
+ f"Model: {color}{model_display:20}{RESET} "
+ f"({color}{indicator:6}{RESET}) "
+ f"Edge: {format_edge(edge)}"
+ )
+
+
+def format_total_comparison(market_total: float, over_prob: float, edge: float) -> str:
+ """Format total comparison line.
+
+ Args:
+ market_total: Market total
+ over_prob: Probability of going over
+ edge: Model edge
+
+ Returns:
+ Formatted comparison string
+ """
+ # Calculate model's estimated fair total
+ model_total = estimate_model_total(market_total, over_prob)
+
+ # Determine model's pick
+ pick_side, confidence = get_total_pick(over_prob)
+
+ if pick_side == "OVER":
+ if confidence == "STRONG":
+ color = GREEN
+ indicator = "OVR**"
+ else:
+ color = CYAN
+ indicator = "OVR*"
+ elif pick_side == "UNDER":
+ if confidence == "STRONG":
+ color = YELLOW
+ indicator = "UND**"
+ else:
+ color = ""
+ indicator = "UND*"
+ else:
+ color = ""
+ indicator = "PASS"
+
+ return (
+ f" Total: {market_total:5.1f} "
+ f"Model: {color}{model_total:5.1f}{RESET} "
+ f"({color}{indicator:6}{RESET}) "
+ f"Edge: {format_edge(edge)}"
+ )
+
+
+def generate_betting_card(
+ target_date: date,
+ odds_dir: Path,
+ predictions_file: Path,
+ output_file: Path | None = None,
+) -> str:
+ """Generate betting card for target date.
+
+ Args:
+ target_date: Date for betting card
+ odds_dir: Directory containing odds files
+ predictions_file: Path to predictions CSV
+ output_file: Optional output file path
+
+ Returns:
+ Formatted betting card as string
+ """
+ # Load data
+ date_str = target_date.isoformat()
+ spreads_path = odds_dir / f"{date_str}_spreads.parquet"
+ totals_path = odds_dir / f"{date_str}_totals.parquet"
+
+ spreads = read_parquet_df(str(spreads_path))
+ totals = read_parquet_df(str(totals_path))
+ predictions = pd.read_csv(predictions_file)
+
+ # Filter for FanDuel (canonical bookmaker)
+ fanduel_spreads = spreads[spreads["bookmaker_key"] == "fanduel"].copy()
+ fanduel_totals = totals[totals["bookmaker_key"] == "fanduel"].copy()
+
+ # Sort games by time
+ predictions["commence_dt"] = pd.to_datetime(predictions["commence_time"])
+ predictions = predictions.sort_values("commence_dt")
+
+ # Build betting card
+ lines = []
+ lines.append("")
+ lines.append("=" * 100)
+ lines.append(f"{BOLD}DAILY BETTING CARD - {target_date.strftime('%A, %B %d, %Y')}{RESET}")
+ lines.append("=" * 100)
+ lines.append("")
+ lines.append(
+ "Model Predictions vs Market Lines (FanDuel) | "
+ "** = Strong Pick (>10% edge) * = Lean (<10% edge)"
+ )
+ lines.append("-" * 100)
+ lines.append("")
+
+ # Track best opportunities
+ best_spread_edges = []
+ best_total_edges = []
+
+ for _, game in predictions.iterrows():
+ event_id = game["event_id"]
+ game_time = game["commence_dt"].strftime("%I:%M %p ET")
+
+ # Get market lines
+ game_spread = fanduel_spreads[fanduel_spreads["event_id"] == event_id]
+ game_total = fanduel_totals[fanduel_totals["event_id"] == event_id]
+
+ if len(game_spread) == 0 or len(game_total) == 0:
+ continue
+
+ market_spread = game_spread.iloc[0]["spread_magnitude"]
+ market_total = game_total.iloc[0]["total"]
+
+ # Format game header
+ lines.append(
+ f"{BOLD}{game['away_team']}{RESET} @ {BOLD}{game['home_team']}{RESET} ({game_time})"
+ )
+
+ # Format spread comparison
+ spread_line = format_spread_comparison(
+ game["favorite_team"],
+ game["underdog_team"],
+ market_spread,
+ game["favorite_cover_prob"],
+ game["spread_edge"],
+ )
+ lines.append(spread_line)
+
+ # Format total comparison
+ total_line = format_total_comparison(market_total, game["over_prob"], game["total_edge"])
+ lines.append(total_line)
+
+ lines.append("")
+
+ # Track best edges
+ if abs(game["spread_edge"]) >= 0.10:
+ best_spread_edges.append(
+ {
+ "game": f"{game['away_team']} @ {game['home_team']}",
+ "time": game_time,
+ "pick": (
+ f"{game['favorite_team']} -{market_spread}"
+ if game["spread_edge"] > 0
+ else (
+ f"{'Away' if game['away_team'] != game['favorite_team'] else 'Home'}"
+ f" +{market_spread}"
+ )
+ ),
+ "edge": game["spread_edge"],
+ "prob": game["favorite_cover_prob"]
+ if game["spread_edge"] > 0
+ else 1 - game["favorite_cover_prob"],
+ }
+ )
+
+ if abs(game["total_edge"]) >= 0.10:
+ best_total_edges.append(
+ {
+ "game": f"{game['away_team']} @ {game['home_team']}",
+ "time": game_time,
+ "pick": (
+ f"Over {market_total}"
+ if game["total_edge"] > 0
+ else f"Under {market_total}"
+ ),
+ "edge": game["total_edge"],
+ "prob": game["over_prob"] if game["total_edge"] > 0 else 1 - game["over_prob"],
+ }
+ )
+
+ # Add best opportunities section
+ lines.append("=" * 100)
+ lines.append(f"{BOLD}TOP OPPORTUNITIES (Edge >= 10%){RESET}")
+ lines.append("=" * 100)
+ lines.append("")
+
+ if best_spread_edges:
+ lines.append(f"{BOLD}SPREAD PLAYS:{RESET}")
+ for opp in sorted(best_spread_edges, key=lambda x: abs(x["edge"]), reverse=True)[:5]:
+ lines.append(
+ f" {opp['pick']:35} ({opp['time']:12}) "
+ f"Prob: {opp['prob']:5.1%} Edge: {format_edge(opp['edge'], 0.0)}"
+ )
+ lines.append("")
+
+ if best_total_edges:
+ lines.append(f"{BOLD}TOTAL PLAYS:{RESET}")
+ for opp in sorted(best_total_edges, key=lambda x: abs(x["edge"]), reverse=True)[:5]:
+ lines.append(
+ f" {opp['pick']:35} ({opp['time']:12}) "
+ f"Prob: {opp['prob']:5.1%} Edge: {format_edge(opp['edge'], 0.0)}"
+ )
+ lines.append("")
+
+ lines.append("=" * 100)
+ lines.append(
+ f"Total Games: {len(predictions)} | "
+ f"Opportunities: {len(best_spread_edges)} spreads, {len(best_total_edges)} totals"
+ )
+ lines.append("=" * 100)
+ lines.append("")
+
+ # Join all lines
+ card = "\n".join(lines)
+
+ # Save to file if requested
+ if output_file:
+ output_file.parent.mkdir(parents=True, exist_ok=True)
+ # Strip ANSI codes for file output
+ import re
+
+ ansi_escape = re.compile(r"\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])")
+ clean_card = ansi_escape.sub("", card)
+ output_file.write_text(clean_card)
+ print(f"Betting card saved to: {output_file}")
+
+ return card
+
+
+def main() -> None:
+ """Generate and display betting card."""
+ parser = argparse.ArgumentParser(description="Generate daily betting card")
+ parser.add_argument(
+ "--date",
+ type=str,
+ default=None,
+ help="Target date (YYYY-MM-DD, default: today)",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=None,
+ help="Output file path (default: display only)",
+ )
+ parser.add_argument(
+ "--odds-dir",
+ type=Path,
+ default=Path("data/odds_api/daily"),
+ help="Directory containing odds files",
+ )
+ parser.add_argument(
+ "--predictions-dir",
+ type=Path,
+ default=Path("predictions"),
+ help="Directory containing predictions",
+ )
+
+ args = parser.parse_args()
+
+ # Determine target date
+ target_date = date.fromisoformat(args.date) if args.date else date.today()
+
+ # Find predictions file
+ predictions_file = args.predictions_dir / f"{target_date.isoformat()}.csv"
+ if not predictions_file.exists():
+ # Try the -fixed version
+ predictions_file = args.predictions_dir / f"{target_date.isoformat()}-fixed.csv"
+
+ if not predictions_file.exists():
+ print(f"Error: Predictions file not found: {predictions_file}")
+ print(f"Run: uv run python scripts/predict_today.py --date {target_date.isoformat()}")
+ raise SystemExit(1)
+
+ # Generate betting card
+ card = generate_betting_card(target_date, args.odds_dir, predictions_file, args.output)
+
+ # Display card
+ print(card)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/prediction/generate_daily_predictions.py b/scripts/prediction/generate_daily_predictions.py
new file mode 100644
index 000000000..6747b2663
--- /dev/null
+++ b/scripts/prediction/generate_daily_predictions.py
@@ -0,0 +1,710 @@
+"""Generate daily predictions for CBB games.
+
+Uses score models for spread predictions and residual model for totals.
+Falls back to score-derived totals when residual model is unavailable.
+
+Output format:
+- home_team, away_team, favorite_team
+- spread_magnitude, total_points
+- predicted_home_score, predicted_away_score
+- predicted_margin, predicted_total
+- favorite_cover_prob, underdog_cover_prob
+- over_prob, under_prob
+- totals_method (residual | score_derived)
+
+Usage:
+ uv run python scripts/prediction/generate_daily_predictions.py
+ uv run python scripts/prediction/generate_daily_predictions.py --date 2026-02-08
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import json
+import logging
+import pickle
+from datetime import date, timedelta
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+from scipy import stats
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df, write_csv
+from sports_betting_edge.adapters.kenpom import KenPomAdapter
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.config.settings import settings
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+# D1 average constants for expected points formula
+DI_AVG_EFF = 109.15 # D1 avg offensive/defensive efficiency (per 100 poss)
+DI_AVG_TEMPO = 67.34 # D1 avg possessions per game
+DEFAULT_HCA_PTS = 3.2 # Fallback HCA when per-team data unavailable
+
+# Score model uncertainty (for spread probabilities)
+COMBINED_STDDEV = 7.6 # sqrt(5.38^2 + 5.00^2)
+
+
+async def fetch_fanmatch_for_date(
+ target_date: date,
+ games: pd.DataFrame,
+ staging_path: Path,
+) -> pd.DataFrame:
+ """Fetch KenPom FanMatch predictions and match to today's games.
+
+ Uses team_mapping.parquet for reliable name matching (not fuzzy).
+
+ Args:
+ target_date: Date to fetch FanMatch for
+ games: Games DataFrame with home_team, away_team columns
+ staging_path: Path to staging directory (for team mapping)
+
+ Returns:
+ Games DataFrame with FanMatch columns added
+ """
+ logger.info(f"Fetching KenPom FanMatch predictions for {target_date}...")
+
+ # Load team mapping (odds_api_name -> kenpom_name for reverse lookup)
+ mapping_path = staging_path / "mappings" / "team_mapping.parquet"
+ if not mapping_path.exists():
+ logger.warning(" Team mapping not found - skipping FanMatch")
+ return games
+
+ team_mapping = read_parquet_df(str(mapping_path))
+ odds_to_kp = dict(
+ zip(
+ team_mapping["odds_api_name"],
+ team_mapping["kenpom_name"],
+ strict=False,
+ )
+ )
+
+ # Fetch from KenPom API for both target date and previous day.
+ # The Odds API uses UTC commence_time, while KenPom uses Eastern Time dates.
+ # Evening ET games (e.g. 7 PM ET = midnight UTC) land on the next UTC date,
+ # so we fetch both days and merge to cover the boundary.
+ kenpom = KenPomAdapter()
+ try:
+ fanmatch_games: list[dict[str, Any]] = []
+ prev_date = target_date - timedelta(days=1)
+ for d in [prev_date, target_date]:
+ try:
+ batch = await kenpom.get_fanmatch(d.isoformat())
+ logger.info(f" FanMatch {d}: {len(batch)} games")
+ fanmatch_games.extend(batch)
+ except Exception as e:
+ logger.debug(f" FanMatch {d} unavailable: {e}")
+ logger.info(f" Received {len(fanmatch_games)} total FanMatch predictions")
+ except Exception as e:
+ logger.warning(f" Failed to fetch FanMatch: {e} - continuing without")
+ return games
+ finally:
+ await kenpom.close()
+
+ if len(fanmatch_games) == 0:
+ logger.info(" No FanMatch games returned")
+ return games
+
+ # Build lookup: (kenpom_home, kenpom_away) -> prediction
+ fm_lookup: dict[tuple[str, str], dict] = {}
+ for game in fanmatch_games:
+ home = game.get("Home")
+ visitor = game.get("Visitor")
+ home_pred = game.get("HomePred")
+ visitor_pred = game.get("VisitorPred")
+
+ if home and visitor and home_pred is not None and visitor_pred is not None:
+ fm_lookup[(home, visitor)] = {
+ "kp_predicted_margin": home_pred - visitor_pred,
+ "kp_predicted_total": home_pred + visitor_pred,
+ "kp_home_wp": game.get("HomeWP"),
+ }
+
+ # Match games using team mapping
+ matched_count = 0
+ kp_margins = []
+ kp_totals = []
+ kp_wps = []
+
+ for _, row in games.iterrows():
+ home_kp = odds_to_kp.get(row["home_team"])
+ away_kp = odds_to_kp.get(row["away_team"])
+
+ if home_kp and away_kp and (home_kp, away_kp) in fm_lookup:
+ fm = fm_lookup[(home_kp, away_kp)]
+ kp_margins.append(fm["kp_predicted_margin"])
+ kp_totals.append(fm["kp_predicted_total"])
+ kp_wps.append(fm["kp_home_wp"])
+ matched_count += 1
+ else:
+ kp_margins.append(None)
+ kp_totals.append(None)
+ kp_wps.append(None)
+
+ games["kp_predicted_margin"] = pd.array(kp_margins, dtype=pd.Float64Dtype())
+ games["kp_predicted_total"] = pd.array(kp_totals, dtype=pd.Float64Dtype())
+ games["kp_home_wp"] = pd.array(kp_wps, dtype=pd.Float64Dtype())
+
+ logger.info(f" Matched {matched_count}/{len(games)} games with FanMatch")
+ return games
+
+
+def load_score_models(models_dir: Path) -> tuple:
+ """Load trained score prediction models."""
+ home_path = models_dir / "home_score_2026.pkl"
+ away_path = models_dir / "away_score_2026.pkl"
+
+ if not home_path.exists() or not away_path.exists():
+ raise FileNotFoundError(f"Score models not found in {models_dir}")
+
+ with open(home_path, "rb") as f:
+ home_model = pickle.load(f)
+ with open(away_path, "rb") as f:
+ away_model = pickle.load(f)
+
+ logger.info(f"Loaded score models from {models_dir}")
+ return home_model, away_model
+
+
+def load_residual_model(models_dir: Path) -> tuple | None:
+ """Load totals residual model and metadata.
+
+ Returns:
+ Tuple of (model, feature_list, residual_std) or None if not available
+ """
+ model_path = models_dir / "totals_residual_2026.pkl"
+ features_path = models_dir / "totals_residual_features.txt"
+ metadata_path = models_dir / "totals_residual_metadata.json"
+
+ if not model_path.exists() or not features_path.exists():
+ logger.warning("Residual model not found, will use score-derived totals")
+ return None
+
+ with open(model_path, "rb") as f:
+ model = pickle.load(f)
+
+ with open(features_path) as f:
+ feature_list = [line.strip() for line in f if line.strip()]
+
+ # Load residual_std from metadata
+ residual_std = 11.82 # Default fallback
+ if metadata_path.exists():
+ with open(metadata_path) as f:
+ metadata = json.load(f)
+ residual_std = metadata.get("residual_std", residual_std)
+
+ logger.info(f"Loaded residual model ({len(feature_list)} features, std={residual_std:.2f})")
+ return model, feature_list, residual_std
+
+
+def get_games_from_daily_parquet(
+ target_date: date,
+ daily_dir: Path,
+ preferred_book: str = "fanduel",
+) -> pd.DataFrame | None:
+ """Load today's games from daily parquet snapshots.
+
+ Falls back to consensus across books if preferred book unavailable.
+ Returns None if daily parquet files don't exist.
+ """
+ date_str = target_date.isoformat()
+ spreads_path = daily_dir / f"{date_str}_spreads.parquet"
+ totals_path = daily_dir / f"{date_str}_totals.parquet"
+
+ if not spreads_path.exists() or not totals_path.exists():
+ return None
+
+ spreads = pd.read_parquet(spreads_path)
+ totals = pd.read_parquet(totals_path)
+
+ if len(spreads) == 0 or len(totals) == 0:
+ return None
+
+ # Try preferred book first, fall back to first available
+ book_col = "bookmaker_key"
+ if preferred_book in spreads[book_col].values:
+ sp = spreads[spreads[book_col] == preferred_book]
+ tot = totals[totals[book_col] == preferred_book]
+ else:
+ # Use first available book per game for consensus
+ first_book = spreads[book_col].iloc[0]
+ logger.info(f" {preferred_book} not in daily parquet, using {first_book}")
+ sp = spreads[spreads[book_col] == first_book]
+ tot = totals[totals[book_col] == first_book]
+
+ # Deduplicate: one row per event
+ sp = sp.drop_duplicates(subset=["event_id"], keep="first")
+ tot = tot.drop_duplicates(subset=["event_id"], keep="first")
+
+ # Build games DataFrame matching the schema expected downstream
+ games = sp[["event_id", "home_team", "away_team", "commence_time"]].copy()
+ games["favorite_team"] = sp["favorite_team"].values
+ games["underdog_team"] = sp["underdog_team"].values
+ games["spread_magnitude"] = sp["spread_magnitude"].values
+
+ # Merge totals
+ tot_cols = tot[["event_id"]].copy()
+ total_col = "total" if "total" in tot.columns else "total_points"
+ tot_cols["total_points"] = tot[total_col].values
+ games = games.merge(tot_cols, on="event_id", how="inner")
+
+ # Opening total = closing total for single-snapshot data
+ games["opening_total"] = games["total_points"]
+
+ logger.info(f"Loaded {len(games)} games from daily parquet ({date_str})")
+ return games
+
+
+def get_todays_games(db: OddsAPIDatabase, target_date: date) -> pd.DataFrame:
+ """Get today's games with consensus odds from streaming DB."""
+ date_str = target_date.isoformat()
+
+ query = f"""
+ WITH spread_odds AS (
+ SELECT
+ o.event_id,
+ o.outcome_name as favorite_team,
+ CASE
+ WHEN o.outcome_name = e.home_team THEN e.away_team
+ ELSE e.home_team
+ END as underdog_team,
+ ABS(o.point) as spread_magnitude,
+ ROW_NUMBER() OVER (
+ PARTITION BY o.event_id
+ ORDER BY o.book_last_update DESC
+ ) as rn
+ FROM observations o
+ JOIN events e ON o.event_id = e.event_id
+ WHERE o.market_key = 'spreads'
+ AND o.book_key = 'fanduel'
+ AND DATE(e.commence_time) = '{date_str}'
+ AND o.point < 0
+ ),
+ total_odds AS (
+ SELECT
+ o.event_id,
+ o.point as total_points,
+ ROW_NUMBER() OVER (
+ PARTITION BY o.event_id
+ ORDER BY o.book_last_update DESC
+ ) as rn
+ FROM observations o
+ JOIN events e ON o.event_id = e.event_id
+ WHERE o.market_key = 'totals'
+ AND o.book_key = 'fanduel'
+ AND o.outcome_name = 'Over'
+ AND DATE(e.commence_time) = '{date_str}'
+ AND o.point IS NOT NULL
+ ),
+ opening_totals AS (
+ SELECT
+ o.event_id,
+ o.point as opening_total,
+ ROW_NUMBER() OVER (
+ PARTITION BY o.event_id
+ ORDER BY o.book_last_update ASC
+ ) as rn
+ FROM observations o
+ JOIN events e ON o.event_id = e.event_id
+ WHERE o.market_key = 'totals'
+ AND o.book_key = 'fanduel'
+ AND o.outcome_name = 'Over'
+ AND DATE(e.commence_time) = '{date_str}'
+ AND o.point IS NOT NULL
+ )
+ SELECT DISTINCT
+ e.event_id,
+ e.home_team,
+ e.away_team,
+ e.commence_time,
+ s.favorite_team,
+ s.underdog_team,
+ s.spread_magnitude,
+ t.total_points,
+ ot.opening_total
+ FROM events e
+ JOIN spread_odds s ON e.event_id = s.event_id AND s.rn = 1
+ JOIN total_odds t ON e.event_id = t.event_id AND t.rn = 1
+ LEFT JOIN opening_totals ot ON e.event_id = ot.event_id AND ot.rn = 1
+ WHERE DATE(e.commence_time) = '{date_str}'
+ ORDER BY e.commence_time
+ """
+
+ games = pd.read_sql_query(query, db.conn)
+ logger.info(f"Found {len(games)} games with complete odds for {date_str}")
+ return games
+
+
+def enrich_with_features(games: pd.DataFrame, staging_path: Path) -> pd.DataFrame:
+ """Enrich games with features from staging layer."""
+ events = read_parquet_df(str(staging_path / "events.parquet"))
+ team_ratings = read_parquet_df(str(staging_path / "team_ratings.parquet"))
+
+ # Merge with events for rest/travel features
+ games = games.merge(
+ events[
+ [
+ "event_id",
+ "home_rest_days",
+ "away_rest_days",
+ "home_back_to_back",
+ "away_back_to_back",
+ "home_short_rest",
+ "away_short_rest",
+ "away_road_streak",
+ "away_days_on_road",
+ ]
+ ],
+ on="event_id",
+ how="left",
+ )
+
+ # Map team names to KenPom features via odds_api_name column
+ team_ratings_home = team_ratings.copy()
+ team_ratings_home.columns = [
+ f"home_{col}" if col != "odds_api_name" else col for col in team_ratings_home.columns
+ ]
+
+ team_ratings_away = team_ratings.copy()
+ team_ratings_away.columns = [
+ f"away_{col}" if col != "odds_api_name" else col for col in team_ratings_away.columns
+ ]
+
+ games = games.merge(
+ team_ratings_home,
+ left_on="home_team",
+ right_on="odds_api_name",
+ how="left",
+ )
+ games = games.merge(
+ team_ratings_away,
+ left_on="away_team",
+ right_on="odds_api_name",
+ how="left",
+ )
+
+ # Calculate derived features (superset for both score and totals models)
+ games["total_offense"] = games["home_adj_o"] + games["away_adj_o"]
+ games["avg_tempo"] = (games["home_adj_t"] + games["away_adj_t"]) / 2
+ games["avg_luck"] = (games["home_luck"] + games["away_luck"]) / 2
+ games["height_diff"] = games["home_height_eff"] - games["away_height_eff"]
+ games["avg_defense"] = (games["home_adj_d"] + games["away_adj_d"]) / 2
+
+ # Rename height_eff to height for consistency with model features
+ games = games.rename(
+ columns={"home_height_eff": "home_height", "away_height_eff": "away_height"}
+ )
+
+ # Expected points (corrected KenPom formula with per-team HCA)
+ game_tempo = (games["home_adj_t"] * games["away_adj_t"]) / DI_AVG_TEMPO
+ # Per-team HCA or league average fallback
+ if "home_hca_pts" in games.columns:
+ home_hca = games["home_hca_pts"].fillna(DEFAULT_HCA_PTS)
+ else:
+ home_hca = DEFAULT_HCA_PTS
+ games["home_expected_pts"] = (games["home_adj_o"] * games["away_adj_d"] / DI_AVG_EFF) * (
+ game_tempo / 100
+ ) + (home_hca / 2)
+ games["away_expected_pts"] = (games["away_adj_o"] * games["home_adj_d"] / DI_AVG_EFF) * (
+ game_tempo / 100
+ ) - (home_hca / 2)
+ games["expected_total"] = games["home_expected_pts"] + games["away_expected_pts"]
+ # Expose HCA as explicit feature
+ if "home_hca" in games.columns:
+ pass # Already present from team_ratings merge
+ elif "home_hca_pts" in games.columns:
+ games["home_hca"] = games["home_hca_pts"].fillna(0.0)
+
+ # Differentials (full set for totals model; score model selects its subset via features.txt)
+ games["adj_em_diff"] = games["home_adj_em"] - games["away_adj_em"]
+ games["pythag_diff"] = games["home_pythag"] - games["away_pythag"]
+ games["adj_o_diff"] = games["home_adj_o"] - games["away_adj_o"]
+ games["adj_d_diff"] = games["home_adj_d"] - games["away_adj_d"]
+ games["adj_t_diff"] = games["home_adj_t"] - games["away_adj_t"]
+ games["efg_pct_diff"] = games["home_efg_pct"] - games["away_efg_pct"]
+ games["to_pct_diff"] = games["home_to_pct"] - games["away_to_pct"]
+ games["or_pct_diff"] = games["home_or_pct"] - games["away_or_pct"]
+ games["ft_rate_diff"] = games["home_ft_rate"] - games["away_ft_rate"]
+ games["sos_diff"] = games["home_sos"] - games["away_sos"]
+ games["luck_diff"] = games["home_luck"] - games["away_luck"]
+
+ # Market features for residual model and score model compatibility
+ games["closing_total"] = games["total_points"]
+ games["total_movement"] = games["total_points"] - games["opening_total"]
+ games["kenpom_market_diff"] = games["expected_total"] - games["total_points"]
+
+ # FanMatch-derived features (if FanMatch data was merged)
+ if "kp_predicted_margin" in games.columns:
+ # Market margin in home perspective for comparison
+ is_home_fav = games["home_team"] == games["favorite_team"]
+ market_home_margin = games["spread_magnitude"].where(
+ is_home_fav, -games["spread_magnitude"]
+ )
+ games["kp_market_margin_diff"] = games["kp_predicted_margin"] - market_home_margin
+ games["kp_market_total_diff"] = games["kp_predicted_total"] - games["total_points"]
+
+ # Fill missing values with defaults BEFORE rest calculations
+ games = games.fillna(
+ {
+ "home_rest_days": 3,
+ "away_rest_days": 3,
+ "home_back_to_back": False,
+ "away_back_to_back": False,
+ "home_short_rest": False,
+ "away_short_rest": False,
+ "away_road_streak": 0,
+ "away_days_on_road": 0,
+ "opening_total": games["total_points"],
+ "total_movement": 0,
+ }
+ )
+
+ # Rest features (calculate after filling NaNs)
+ games["rest_advantage"] = games["home_rest_days"] - games["away_rest_days"]
+ games["total_back_to_back"] = games["home_back_to_back"].astype(int) + games[
+ "away_back_to_back"
+ ].astype(int)
+ games["total_short_rest"] = games["home_short_rest"].astype(int) + games[
+ "away_short_rest"
+ ].astype(int)
+
+ logger.info(f"Enriched {len(games)} games with staging features")
+ return games
+
+
+def make_predictions(
+ games: pd.DataFrame,
+ home_model: object,
+ away_model: object,
+ score_features: list[str],
+ residual_model: object | None = None,
+ residual_features: list[str] | None = None,
+ residual_std: float = 11.82,
+) -> pd.DataFrame:
+ """Generate predictions using score models (spreads) and residual model (totals).
+
+ Args:
+ games: Enriched games DataFrame
+ home_model: Trained home score model
+ away_model: Trained away score model
+ score_features: Feature list for score models
+ residual_model: Optional trained residual model for totals
+ residual_features: Feature list for residual model
+ residual_std: Standard deviation of residuals for probability calc
+ """
+ # Score predictions (used for spreads, and fallback for totals)
+ X_score = games[score_features].fillna(0.0)
+ home_scores = home_model.predict(X_score)
+ away_scores = away_model.predict(X_score)
+
+ # Build results
+ results = games[
+ [
+ "home_team",
+ "away_team",
+ "favorite_team",
+ "spread_magnitude",
+ "total_points",
+ ]
+ ].copy()
+
+ results["predicted_home_score"] = home_scores.round(1)
+ results["predicted_away_score"] = away_scores.round(1)
+ results["predicted_margin"] = (home_scores - away_scores).round(1)
+
+ # === SPREAD PROBABILITIES (from score models) ===
+ effective_margins = []
+ for _, row in results.iterrows():
+ margin = row["predicted_margin"]
+ if row["favorite_team"] == row["home_team"]:
+ effective_margins.append(margin)
+ else:
+ effective_margins.append(-margin)
+
+ effective_margins_s = pd.Series(effective_margins)
+ spread_cushions = effective_margins_s - results["spread_magnitude"]
+ z_scores = spread_cushions / COMBINED_STDDEV
+ results["favorite_cover_prob"] = z_scores.apply(stats.norm.cdf).apply(lambda x: f"{x:.0%}")
+ results["underdog_cover_prob"] = (1 - z_scores.apply(stats.norm.cdf)).apply(
+ lambda x: f"{x:.0%}"
+ )
+
+ # Always compute score-derived total for monitoring
+ score_totals = home_scores + away_scores
+ results["score_derived_total"] = score_totals.round(1)
+
+ # === TOTAL PROBABILITIES ===
+ if residual_model is not None and residual_features is not None:
+ # Residual model: predicted_total = closing_total + residual
+ X_residual = games[residual_features].fillna(0.0)
+ predicted_residuals = residual_model.predict(X_residual)
+
+ results["predicted_total"] = (games["total_points"] + predicted_residuals).round(1)
+
+ # Over probability: P(actual > closing) = P(residual > 0)
+ # We use the predicted residual as signal, scaled by std
+ z_totals = predicted_residuals / residual_std
+ over_probs = pd.Series(z_totals).apply(stats.norm.cdf)
+
+ results["over_prob"] = over_probs.apply(lambda x: f"{x:.0%}")
+ results["under_prob"] = (1 - over_probs).apply(lambda x: f"{x:.0%}")
+ results["totals_method"] = "residual"
+
+ logger.info(
+ f"Residual model: avg predicted residual = "
+ f"{predicted_residuals.mean():.2f}, "
+ f"avg over_prob = {over_probs.mean():.1%}"
+ )
+ else:
+ # Fallback: score-derived totals
+ results["predicted_total"] = score_totals.round(1)
+
+ total_cushions = score_totals - results["total_points"]
+ z_totals = total_cushions / COMBINED_STDDEV
+ results["over_prob"] = z_totals.apply(stats.norm.cdf).apply(lambda x: f"{x:.0%}")
+ results["under_prob"] = (1 - z_totals.apply(stats.norm.cdf)).apply(lambda x: f"{x:.0%}")
+ results["totals_method"] = "score_derived"
+
+ logger.info("Using score-derived totals (no residual model)")
+
+ # Disconnect between score models and final total prediction
+ results["total_disconnect"] = (
+ (results["score_derived_total"] - results["predicted_total"]).abs().round(1)
+ )
+
+ # Warn about large disconnects
+ large_disconnect = results[results["total_disconnect"] > 10]
+ if len(large_disconnect) > 0:
+ logger.warning(f"[WARNING] {len(large_disconnect)} games have total_disconnect > 10 pts:")
+ for _, row in large_disconnect.iterrows():
+ logger.warning(
+ f" {row['away_team']} @ {row['home_team']}: "
+ f"score_derived={row['score_derived_total']}, "
+ f"predicted={row['predicted_total']}, "
+ f"disconnect={row['total_disconnect']}"
+ )
+
+ return results
+
+
+def main() -> None:
+ """Generate predictions for today's games."""
+ parser = argparse.ArgumentParser(description="Generate daily predictions")
+ parser.add_argument(
+ "--date",
+ type=str,
+ default=None,
+ help="Target date (YYYY-MM-DD, default: today)",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=settings.odds_api_db_path,
+ help="Odds database path",
+ )
+ parser.add_argument(
+ "--staging-path",
+ type=Path,
+ default=settings.staging_dir,
+ help="Staging data path",
+ )
+ parser.add_argument(
+ "--models-dir",
+ type=Path,
+ default=settings.models_dir,
+ help="Models directory",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=None,
+ help="Output CSV path",
+ )
+
+ args = parser.parse_args()
+
+ # Determine target date
+ target_date = date.fromisoformat(args.date) if args.date else date.today()
+ logger.info(f"Generating predictions for {target_date}")
+
+ # Determine output path
+ if args.output:
+ output_path = args.output
+ else:
+ output_dir = settings.predictions_dir
+ output_dir.mkdir(parents=True, exist_ok=True)
+ output_path = output_dir / f"{target_date.isoformat()}.csv"
+
+ try:
+ # Load score models (always needed for spreads)
+ home_model, away_model = load_score_models(args.models_dir)
+
+ # Load score feature list
+ features_path = args.models_dir / "score_features.txt"
+ with open(features_path) as f:
+ score_features = [line.strip() for line in f if line.strip()]
+ logger.info(f"Loaded {len(score_features)} score features")
+
+ # Load residual model (optional, for totals)
+ residual_result = load_residual_model(args.models_dir)
+ residual_model = None
+ residual_features = None
+ residual_std = 11.82
+ if residual_result is not None:
+ residual_model, residual_features, residual_std = residual_result
+
+ # Get today's games: try daily parquet first (fast), fall back to DB
+ daily_dir = settings.daily_odds_dir
+ games = get_games_from_daily_parquet(target_date, daily_dir)
+
+ if games is None or len(games) == 0:
+ logger.info("Daily parquet not available, querying streaming DB")
+ db = OddsAPIDatabase(args.db_path)
+ games = get_todays_games(db, target_date)
+ db.close()
+
+ if len(games) == 0:
+ logger.error(f"No games found for {target_date}")
+ return
+
+ # Fetch FanMatch predictions for today's games
+ games = asyncio.run(fetch_fanmatch_for_date(target_date, games, args.staging_path))
+
+ # Enrich with features
+ games = enrich_with_features(games, args.staging_path)
+
+ # Make predictions
+ predictions = make_predictions(
+ games,
+ home_model,
+ away_model,
+ score_features,
+ residual_model=residual_model,
+ residual_features=residual_features,
+ residual_std=residual_std,
+ )
+
+ # Save
+ write_csv(predictions, str(output_path), index=False)
+ logger.info(f"[OK] Saved {len(predictions)} predictions to {output_path}")
+
+ # Display summary
+ logger.info(f"\nPredictions for {target_date}:")
+ for _, game in predictions.iterrows():
+ logger.info(
+ f" {game['away_team']} @ {game['home_team']}: "
+ f"{game['predicted_away_score']}-{game['predicted_home_score']} "
+ f"(Spread: {game['favorite_team']} {game['favorite_cover_prob']}, "
+ f"Total: O{game['over_prob']} [{game['totals_method']}])"
+ )
+
+ except Exception as e:
+ logger.error(f"Prediction failed: {e}", exc_info=True)
+ raise
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/prediction/generate_todays_analysis.py b/scripts/prediction/generate_todays_analysis.py
new file mode 100644
index 000000000..6479f09d1
--- /dev/null
+++ b/scripts/prediction/generate_todays_analysis.py
@@ -0,0 +1,238 @@
+#!/usr/bin/env python3
+"""Generate today's betting analysis with verified KenPom data.
+
+Properly filters by season 2026 and validates all team matches.
+"""
+
+from __future__ import annotations
+
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+from rich.console import Console
+
+from sports_betting_edge.adapters.filesystem import write_csv
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.utils.team_matching import match_to_kenpom
+
+DB_PATH = Path("data/odds_api/odds_api.sqlite3")
+OUTPUT_DIR = Path("data/analysis")
+SEASON = 2026
+
+console = Console()
+
+
+def generate_analysis() -> pd.DataFrame | None:
+ """Generate today's analysis with proper season filtering."""
+ today = datetime.now().date().isoformat()
+ console.print(f"\n[bold cyan]Generating Analysis for {today}[/bold cyan]\n")
+
+ db = OddsAPIDatabase(str(DB_PATH))
+
+ # Load KenPom ratings for current season ONLY
+ console.print(f"[1/4] Loading KenPom ratings (season {SEASON})...")
+ kenpom_query = f"""
+ SELECT team, adj_em, adj_o, adj_d, adj_t, rank
+ FROM kp_pomeroy_ratings
+ WHERE season = {SEASON}
+ ORDER BY team
+ """
+ kenpom_df = pd.read_sql_query(kenpom_query, db.conn)
+ console.print(f" [OK] Loaded {len(kenpom_df)} teams")
+
+ # Create KenPom lookup
+ kenpom_teams = kenpom_df["team"].tolist()
+ kenpom_lookup = kenpom_df.set_index("team").to_dict(orient="index")
+
+ # Load today's games from overtime.ag data
+ console.print("\n[2/4] Loading today's odds from overtime.ag...")
+
+ # First, check the most recent overtime data
+ overtime_query = """
+ SELECT DISTINCT
+ game_id,
+ game_time,
+ away_team,
+ home_team,
+ away_spread,
+ home_spread,
+ away_spread_juice,
+ home_spread_juice,
+ total,
+ over_juice,
+ under_juice
+ FROM (
+ SELECT *
+ FROM overtime_lines
+ WHERE DATE(game_time) = DATE('now')
+ ORDER BY collected_at DESC
+ )
+ GROUP BY game_id
+ """
+
+ try:
+ odds_df = pd.read_sql_query(overtime_query, db.conn)
+ console.print(f" [OK] Loaded {len(odds_df)} games from overtime.ag")
+ except Exception as e:
+ console.print(f" [ERROR] Failed to load from overtime_lines: {e}")
+ console.print(" Trying alternative source...")
+
+ # Fallback: try to load from observations table
+ odds_query = """
+ WITH latest_obs AS (
+ SELECT
+ e.event_id,
+ e.away_team,
+ e.home_team,
+ e.commence_time as game_time,
+ o.market_key,
+ o.outcome_name,
+ o.point,
+ o.price_american,
+ ROW_NUMBER() OVER (
+ PARTITION BY e.event_id, o.market_key, o.outcome_name
+ ORDER BY o.fetched_at DESC
+ ) as rn
+ FROM events e
+ JOIN observations o ON e.event_id = o.event_id
+ WHERE e.sport_key = 'basketball_ncaab'
+ AND DATE(e.commence_time) = DATE('now')
+ AND o.market_key IN ('spreads', 'totals')
+ )
+ SELECT DISTINCT
+ event_id as game_id,
+ game_time,
+ away_team,
+ home_team
+ FROM latest_obs
+ WHERE rn = 1
+ """
+ odds_df = pd.read_sql_query(odds_query, db.conn)
+ console.print(f" [OK] Loaded {len(odds_df)} games from observations table")
+
+ if len(odds_df) == 0:
+ console.print("[red][ERROR] No games found for today![/red]")
+ return None
+
+ # Match teams to KenPom
+ console.print("\n[3/4] Matching teams to KenPom...")
+ matched_games = []
+ failed_matches = []
+
+ for _, game in odds_df.iterrows():
+ away_team = game["away_team"]
+ home_team = game["home_team"]
+
+ # Match away team
+ away_kp = match_to_kenpom(away_team, kenpom_teams)
+ home_kp = match_to_kenpom(home_team, kenpom_teams)
+
+ if away_kp is None or home_kp is None:
+ failed_matches.append((away_team, home_team))
+ continue
+
+ # Get KenPom data
+ away_data = kenpom_lookup.get(away_kp, {})
+ home_data = kenpom_lookup.get(home_kp, {})
+
+ if not away_data or not home_data:
+ failed_matches.append((away_team, home_team))
+ continue
+
+ # Calculate KenPom margin (home perspective)
+ kenpom_margin = home_data.get("adj_em", 0) - away_data.get("adj_em", 0)
+
+ # Build result
+ result = {
+ "game_id": game.get("game_id", f"{away_team}@{home_team}"),
+ "game_time": game["game_time"],
+ "away_team": away_team,
+ "home_team": home_team,
+ "away_kenpom_name": away_kp,
+ "home_kenpom_name": home_kp,
+ # Away KenPom
+ "away_adjoe": away_data.get("adj_o"),
+ "away_adjde": away_data.get("adj_d"),
+ "away_adjem": away_data.get("adj_em"),
+ "away_tempo": away_data.get("adj_t"),
+ "away_rank": away_data.get("rank"),
+ # Home KenPom
+ "home_adjoe": home_data.get("adj_o"),
+ "home_adjde": home_data.get("adj_d"),
+ "home_adjem": home_data.get("adj_em"),
+ "home_tempo": home_data.get("adj_t"),
+ "home_rank": home_data.get("rank"),
+ # KenPom prediction
+ "kenpom_margin": kenpom_margin,
+ # Market lines
+ "home_spread": game.get("home_spread"),
+ "home_spread_juice": game.get("home_spread_juice"),
+ "away_spread": game.get("away_spread"),
+ "away_spread_juice": game.get("away_spread_juice"),
+ "total": game.get("total"),
+ "over_juice": game.get("over_juice"),
+ "under_juice": game.get("under_juice"),
+ }
+
+ # Calculate edge (KenPom margin vs market spread)
+ if result["home_spread"] is not None:
+ market_margin = -result["home_spread"] # Negative spread means favored
+ discrepancy = kenpom_margin - market_margin
+ result["market_margin"] = market_margin
+ result["discrepancy"] = discrepancy
+ result["abs_discrepancy"] = abs(discrepancy)
+
+ matched_games.append(result)
+
+ console.print(f" [OK] Matched {len(matched_games)} games")
+
+ if failed_matches:
+ console.print(f" [WARN] Failed to match {len(failed_matches)} games:")
+ for away, home in failed_matches:
+ console.print(f" - {away} @ {home}")
+
+ # Create DataFrame
+ analysis_df = pd.DataFrame(matched_games)
+
+ # Sort by absolute discrepancy (largest edges first)
+ if "abs_discrepancy" in analysis_df.columns:
+ analysis_df = analysis_df.sort_values("abs_discrepancy", ascending=False)
+
+ # Save output
+ console.print("\n[4/4] Saving analysis...")
+ OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+
+ output_path = OUTPUT_DIR / f"analysis_{today}_verified.csv"
+ write_csv(analysis_df, output_path, index=False)
+
+ console.print(f" [OK] Saved to {output_path}")
+
+ # Summary
+ console.print("\n" + "=" * 80)
+ console.print("[bold]SUMMARY[/bold]")
+ console.print("=" * 80)
+
+ console.print(f"\nTotal games analyzed: {len(analysis_df)}")
+ console.print(f"Games with KenPom edges: {analysis_df['abs_discrepancy'].notna().sum()}")
+
+ if "abs_discrepancy" in analysis_df.columns:
+ edges_df = analysis_df[analysis_df["abs_discrepancy"] >= 3.5]
+ console.print(f"Games with 3.5+ point edges: {len(edges_df)}")
+
+ if len(edges_df) > 0:
+ console.print("\n[bold]Top 5 Edges:[/bold]")
+ for _, game in edges_df.head(5).iterrows():
+ console.print(
+ f" {game['away_team']} @ {game['home_team']}: "
+ f"{game['abs_discrepancy']:.1f} pts (KenPom: {game['kenpom_margin']:+.1f}, "
+ f"Market: {game['home_spread']:+.1f})"
+ )
+
+ console.print("\n[bold green][OK] Analysis complete![/bold green]\n")
+
+ return analysis_df
+
+
+if __name__ == "__main__":
+ generate_analysis()
diff --git a/scripts/prediction/predict_with_kenpom.py b/scripts/prediction/predict_with_kenpom.py
new file mode 100644
index 000000000..fe6429068
--- /dev/null
+++ b/scripts/prediction/predict_with_kenpom.py
@@ -0,0 +1,452 @@
+"""Generate predictions with KenPom FanMatch data and score models.
+
+Combines:
+- Opening/closing odds from database
+- Probability predictions from trained models
+- KenPom FanMatch predicted scores and spreads
+- Score regression model predictions (if available)
+
+Usage:
+ python scripts/prediction/predict_with_kenpom.py
+ python scripts/prediction/predict_with_kenpom.py --output predictions/today.csv
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+from datetime import datetime
+from pathlib import Path
+from zoneinfo import ZoneInfo
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.adapters.kenpom import KenPomAdapter
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+PST = ZoneInfo("America/Los_Angeles")
+
+
+def extract_legacy_features(db: OddsAPIDatabase, events_df: pd.DataFrame) -> pd.DataFrame:
+ """Extract features matching legacy dataset format.
+
+ Args:
+ db: Database connection
+ events_df: DataFrame with event_id, home_team, away_team
+
+ Returns:
+ DataFrame with features for spreads and totals models
+ """
+ logger.info(f"Extracting features for {len(events_df)} events...")
+
+ features_list = []
+
+ for _, event in events_df.iterrows():
+ event_id = event["event_id"]
+ home_team = event["home_team"]
+ away_team = event["away_team"]
+
+ # Get spreads
+ spreads = db.get_canonical_spreads(event_id=event_id)
+ if len(spreads) == 0:
+ continue
+
+ # Get consensus opening and closing spreads
+ spreads_by_time = spreads.sort_values("as_of")
+ opening_spreads = spreads_by_time.head(10)
+ closing_spreads = spreads_by_time.tail(10)
+
+ opening_spread_magnitude = opening_spreads["spread_magnitude"].median()
+ closing_spread_magnitude = closing_spreads["spread_magnitude"].median()
+ opening_spread_range = (
+ opening_spreads["spread_magnitude"].max() - opening_spreads["spread_magnitude"].min()
+ )
+ closing_spread_range = (
+ closing_spreads["spread_magnitude"].max() - closing_spreads["spread_magnitude"].min()
+ )
+ num_books_spread = spreads["book_key"].nunique()
+
+ # Get favorite from most recent spreads
+ latest_spread = spreads_by_time.iloc[-1]
+ home_is_favorite = latest_spread["favorite_team"] == home_team
+
+ # Calculate implied probabilities (simple conversion)
+ if home_is_favorite:
+ closing_home_implied_prob = 0.5 + (closing_spread_magnitude / 100) * 0.5
+ opening_home_implied_prob = 0.5 + (opening_spread_magnitude / 100) * 0.5
+ else:
+ closing_home_implied_prob = 0.5 - (closing_spread_magnitude / 100) * 0.5
+ opening_home_implied_prob = 0.5 - (opening_spread_magnitude / 100) * 0.5
+
+ # Get totals
+ totals = db.get_canonical_totals(event_id=event_id)
+ if len(totals) > 0:
+ totals_by_time = totals.sort_values("as_of")
+ opening_totals = totals_by_time.head(10)
+ closing_totals = totals_by_time.tail(10)
+
+ opening_total = opening_totals["total"].median()
+ closing_total = closing_totals["total"].median()
+ opening_total_range = opening_totals["total"].max() - opening_totals["total"].min()
+ closing_total_range = closing_totals["total"].max() - closing_totals["total"].min()
+ num_books_total = totals["book_key"].nunique()
+ total_movement = closing_total - opening_total
+ else:
+ opening_total = None
+ closing_total = None
+ opening_total_range = None
+ closing_total_range = None
+ num_books_total = None
+ total_movement = None
+
+ features_list.append(
+ {
+ "event_id": event_id,
+ "home_team": home_team,
+ "away_team": away_team,
+ # Spreads features
+ "consensus_opening_spread_magnitude": opening_spread_magnitude,
+ "consensus_closing_spread_magnitude": closing_spread_magnitude,
+ "opening_spread_range": opening_spread_range,
+ "closing_spread_range": closing_spread_range,
+ "num_books_spread": num_books_spread,
+ "spread_magnitude_movement": abs(
+ closing_spread_magnitude - opening_spread_magnitude
+ ),
+ "opening_home_implied_prob": opening_home_implied_prob,
+ "closing_home_implied_prob": closing_home_implied_prob,
+ "home_is_favorite": 1 if home_is_favorite else 0,
+ # Totals features
+ "consensus_opening_total": opening_total,
+ "consensus_closing_total": closing_total,
+ "opening_total_range": opening_total_range,
+ "closing_total_range": closing_total_range,
+ "num_books_total": num_books_total,
+ "total_movement": total_movement,
+ }
+ )
+
+ features_df = pd.DataFrame(features_list)
+ logger.info(f"Extracted features for {len(features_df)} events")
+
+ return features_df
+
+
+async def fetch_kenpom_fanmatch(game_date: str, events_df: pd.DataFrame) -> pd.DataFrame:
+ """Fetch KenPom FanMatch predictions for games on specific date.
+
+ Args:
+ game_date: Date in YYYY-MM-DD format
+ events_df: DataFrame with home_team, away_team
+
+ Returns:
+ DataFrame with KenPom predictions merged by team names
+ """
+ logger.info(f"Fetching KenPom FanMatch predictions for {game_date}...")
+
+ kenpom = KenPomAdapter()
+ try:
+ fanmatch_games = await kenpom.get_fanmatch(game_date)
+ logger.info(f" Received {len(fanmatch_games)} FanMatch predictions")
+ except Exception as e:
+ logger.warning(f" Failed to fetch FanMatch data: {e}")
+ return pd.DataFrame()
+ finally:
+ await kenpom.close()
+
+ if len(fanmatch_games) == 0:
+ return pd.DataFrame()
+
+ # Parse FanMatch data
+ fanmatch_list = []
+ for game in fanmatch_games:
+ home_pred = game.get("HomePred")
+ visitor_pred = game.get("VisitorPred")
+
+ fanmatch_list.append(
+ {
+ "home": game.get("Home"),
+ "visitor": game.get("Visitor"),
+ "kenpom_home_pred": home_pred,
+ "kenpom_visitor_pred": visitor_pred,
+ "kenpom_home_wp": game.get("HomeWP"),
+ "kenpom_predicted_total": (
+ home_pred + visitor_pred if home_pred and visitor_pred else None
+ ),
+ "kenpom_predicted_spread": (
+ home_pred - visitor_pred if home_pred and visitor_pred else None
+ ),
+ }
+ )
+
+ fanmatch_df = pd.DataFrame(fanmatch_list)
+
+ # Match with our events using team name mappings
+ # KenPom uses short names like "Duke", we use "Duke Blue Devils"
+ matched = []
+ for _, event in events_df.iterrows():
+ our_home = event["home_team"]
+ our_away = event["away_team"]
+
+ # Try to find matching FanMatch game
+ for _, fm in fanmatch_df.iterrows():
+ kp_home = fm["home"]
+ kp_visitor = fm["visitor"]
+
+ # Skip if team names are None
+ if not kp_home or not kp_visitor:
+ continue
+
+ # Match if KenPom name is contained in our full name
+ # Example: "Duke" in "Duke Blue Devils"
+ home_match = kp_home in our_home or our_home in kp_home
+ away_match = kp_visitor in our_away or our_away in kp_visitor
+
+ if home_match and away_match:
+ matched.append(
+ {
+ "home_team": our_home,
+ "away_team": our_away,
+ "kenpom_predicted_home_score": fm["kenpom_home_pred"],
+ "kenpom_predicted_away_score": fm["kenpom_visitor_pred"],
+ "kenpom_predicted_spread": fm["kenpom_predicted_spread"],
+ "kenpom_predicted_total": fm["kenpom_predicted_total"],
+ "kenpom_home_wp": fm["kenpom_home_wp"],
+ }
+ )
+ break
+
+ matched_df = pd.DataFrame(matched)
+ logger.info(f" Matched {len(matched_df)} games with our events")
+
+ return matched_df
+
+
+async def main_async(args: argparse.Namespace) -> None:
+ """Main async function to generate predictions."""
+ logger.info("[OK] === Generating Predictions with KenPom Data ===\n")
+
+ # Train models
+ logger.info("Training models with best seed (1024)...")
+
+ import xgboost as xgb
+
+ df = read_parquet_df("data/staging/complete_dataset.parquet")
+
+ # Train spreads model
+ spreads_features = [
+ "consensus_opening_spread_magnitude",
+ "consensus_closing_spread_magnitude",
+ "opening_spread_range",
+ "closing_spread_range",
+ "num_books_spread",
+ "spread_magnitude_movement",
+ "opening_home_implied_prob",
+ "closing_home_implied_prob",
+ "home_is_favorite",
+ ]
+ spreads_df_clean = df.dropna(subset=spreads_features + ["home_covered_spread"])
+ X_spreads_train = spreads_df_clean[spreads_features]
+ y_spreads_train = spreads_df_clean["home_covered_spread"]
+
+ spreads_params = {
+ "n_estimators": 300,
+ "max_depth": 6,
+ "learning_rate": 0.1,
+ "min_child_weight": 5,
+ "gamma": 1.0,
+ "reg_alpha": 1.0,
+ "reg_lambda": 1.0,
+ "subsample": 0.8,
+ "colsample_bytree": 0.8,
+ "objective": "binary:logistic",
+ "random_state": 1024,
+ }
+ spreads_model = xgb.XGBClassifier(**spreads_params)
+ spreads_model.fit(X_spreads_train, y_spreads_train, verbose=False)
+
+ # Train totals model
+ totals_features = [
+ "consensus_opening_total",
+ "consensus_closing_total",
+ "opening_total_range",
+ "closing_total_range",
+ "num_books_total",
+ "total_movement",
+ ]
+ totals_df_clean = df.dropna(subset=totals_features + ["went_over"])
+ X_totals_train = totals_df_clean[totals_features]
+ y_totals_train = totals_df_clean["went_over"]
+
+ totals_params = {
+ "n_estimators": 300,
+ "max_depth": 6,
+ "learning_rate": 0.1,
+ "min_child_weight": 5,
+ "gamma": 1.0,
+ "reg_alpha": 1.0,
+ "reg_lambda": 1.0,
+ "subsample": 0.8,
+ "colsample_bytree": 0.8,
+ "objective": "binary:logistic",
+ "random_state": 1024,
+ }
+ totals_model = xgb.XGBClassifier(**totals_params)
+ totals_model.fit(X_totals_train, y_totals_train, verbose=False)
+
+ logger.info(" Spreads model loaded ✅")
+ logger.info(" Totals model loaded ✅\n")
+
+ # Get today's games
+ db = OddsAPIDatabase(args.db_path)
+
+ query = """
+ SELECT event_id, home_team, away_team, commence_time
+ FROM events
+ WHERE DATE(commence_time) >= DATE('now')
+ AND DATE(commence_time) <= DATE('now', '+1 day')
+ ORDER BY commence_time
+ """
+
+ events_df = pd.read_sql_query(query, db.conn)
+ logger.info(f"Found {len(events_df)} games\n")
+
+ if len(events_df) == 0:
+ logger.warning("No games found for today/tomorrow")
+ return
+
+ # Extract features
+ features_df = extract_legacy_features(db, events_df)
+
+ # Merge with event info
+ predictions_df = events_df.merge(
+ features_df, on="event_id", how="inner", suffixes=("_event", "")
+ )
+
+ # Fetch KenPom FanMatch predictions
+ today = datetime.now(PST).date().isoformat()
+ kenpom_df = await fetch_kenpom_fanmatch(today, events_df)
+
+ # Merge KenPom predictions
+ if len(kenpom_df) > 0:
+ predictions_df = predictions_df.merge(kenpom_df, on=["home_team", "away_team"], how="left")
+
+ # Make spreads predictions
+ spreads_df = predictions_df.dropna(subset=spreads_features)
+ logger.info(f"Generating spreads predictions for {len(spreads_df)} games...")
+
+ if len(spreads_df) > 0:
+ X_spreads = spreads_df[spreads_features]
+ spreads_probs = spreads_model.predict_proba(X_spreads)[:, 1]
+ spreads_df.loc[:, "home_spread_prob"] = spreads_probs
+ spreads_df.loc[:, "away_spread_prob"] = 1 - spreads_probs
+
+ # Make totals predictions
+ totals_df = predictions_df.dropna(subset=totals_features)
+ logger.info(f"Generating totals predictions for {len(totals_df)} games...")
+
+ if len(totals_df) > 0:
+ X_totals = totals_df[totals_features]
+ over_probs = totals_model.predict_proba(X_totals)[:, 1]
+ totals_df.loc[:, "over_prob"] = over_probs
+ totals_df.loc[:, "under_prob"] = 1 - over_probs
+
+ # Combine predictions
+ final_df = predictions_df[["event_id", "home_team", "away_team", "commence_time"]].copy()
+
+ if len(spreads_df) > 0:
+ final_df = final_df.merge(
+ spreads_df[
+ [
+ "event_id",
+ "consensus_opening_spread_magnitude",
+ "consensus_closing_spread_magnitude",
+ "home_spread_prob",
+ "away_spread_prob",
+ ]
+ ],
+ on="event_id",
+ how="left",
+ )
+
+ if len(totals_df) > 0:
+ final_df = final_df.merge(
+ totals_df[
+ [
+ "event_id",
+ "consensus_opening_total",
+ "consensus_closing_total",
+ "over_prob",
+ "under_prob",
+ ]
+ ],
+ on="event_id",
+ how="left",
+ )
+
+ # Add KenPom columns if available
+ if len(kenpom_df) > 0:
+ final_df = final_df.merge(
+ kenpom_df[
+ [
+ "home_team",
+ "away_team",
+ "kenpom_predicted_home_score",
+ "kenpom_predicted_away_score",
+ "kenpom_predicted_spread",
+ "kenpom_predicted_total",
+ "kenpom_home_wp",
+ ]
+ ],
+ on=["home_team", "away_team"],
+ how="left",
+ )
+
+ # Sort by commence time
+ final_df = final_df.sort_values("commence_time")
+
+ # Save
+ args.output.parent.mkdir(parents=True, exist_ok=True)
+ final_df.to_csv(args.output, index=False)
+
+ logger.info(f"\n[OK] Saved {len(final_df)} predictions to {args.output}")
+
+ # Show summary
+ logger.info("\n=== Prediction Summary ===")
+ logger.info(f" Total games: {len(final_df)}")
+ logger.info(f" With spreads: {final_df['home_spread_prob'].notna().sum()}")
+ logger.info(f" With totals: {final_df['over_prob'].notna().sum()}")
+ if len(kenpom_df) > 0:
+ logger.info(f" With KenPom: {final_df['kenpom_predicted_spread'].notna().sum()}")
+
+
+def main() -> None:
+ """Entry point."""
+ parser = argparse.ArgumentParser(description="Generate predictions with KenPom data")
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=Path("predictions") / f"{datetime.now(PST).date()}.csv",
+ help="Output CSV path",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Database path",
+ )
+
+ args = parser.parse_args()
+
+ # Run async main
+ asyncio.run(main_async(args))
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/prediction/predict_with_legacy_models.py b/scripts/prediction/predict_with_legacy_models.py
new file mode 100644
index 000000000..16af8df57
--- /dev/null
+++ b/scripts/prediction/predict_with_legacy_models.py
@@ -0,0 +1,364 @@
+"""Generate predictions using legacy trained models.
+
+Uses the seed ensemble models trained on complete_dataset.parquet features.
+
+Usage:
+ python scripts/prediction/predict_with_legacy_models.py
+ python scripts/prediction/predict_with_legacy_models.py --output predictions/today.csv
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from datetime import datetime
+from pathlib import Path
+from zoneinfo import ZoneInfo
+
+import pandas as pd
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+PST = ZoneInfo("America/Los_Angeles")
+
+
+def extract_legacy_features(db: OddsAPIDatabase, events_df: pd.DataFrame) -> pd.DataFrame:
+ """Extract features matching legacy dataset format.
+
+ Args:
+ db: Database connection
+ events_df: DataFrame with event_id, home_team, away_team
+
+ Returns:
+ DataFrame with features for spreads and totals models
+ """
+ logger.info(f"Extracting features for {len(events_df)} events...")
+
+ # Get canonical spreads and totals for each event
+ features_list = []
+
+ for _, event in events_df.iterrows():
+ event_id = event["event_id"]
+ home_team = event["home_team"]
+ away_team = event["away_team"]
+
+ # Get spreads
+ spreads = db.get_canonical_spreads(event_id=event_id)
+ if len(spreads) == 0:
+ continue
+
+ # Get consensus opening and closing spreads
+ spreads_by_time = spreads.sort_values("as_of")
+ opening_spreads = spreads_by_time.head(10) # First 10 observations
+ closing_spreads = spreads_by_time.tail(10) # Last 10 observations
+
+ opening_spread_magnitude = opening_spreads["spread_magnitude"].median()
+ closing_spread_magnitude = closing_spreads["spread_magnitude"].median()
+ opening_spread_range = (
+ opening_spreads["spread_magnitude"].max() - opening_spreads["spread_magnitude"].min()
+ )
+ closing_spread_range = (
+ closing_spreads["spread_magnitude"].max() - closing_spreads["spread_magnitude"].min()
+ )
+ num_books_spread = spreads["book_key"].nunique()
+
+ # Get favorite from most recent spreads
+ latest_spread = spreads_by_time.iloc[-1]
+ home_is_favorite = latest_spread["favorite_team"] == home_team
+
+ # Calculate implied probabilities (simple conversion)
+ # Using -110 juice assumption
+ if home_is_favorite:
+ closing_home_implied_prob = 0.5 + (closing_spread_magnitude / 100) * 0.5
+ opening_home_implied_prob = 0.5 + (opening_spread_magnitude / 100) * 0.5
+ else:
+ closing_home_implied_prob = 0.5 - (closing_spread_magnitude / 100) * 0.5
+ opening_home_implied_prob = 0.5 - (opening_spread_magnitude / 100) * 0.5
+
+ # Get totals
+ totals = db.get_canonical_totals(event_id=event_id)
+ if len(totals) > 0:
+ totals_by_time = totals.sort_values("as_of")
+ opening_totals = totals_by_time.head(10)
+ closing_totals = totals_by_time.tail(10)
+
+ # canonical_totals uses "total" column for total points
+ opening_total = opening_totals["total"].median()
+ closing_total = closing_totals["total"].median()
+ opening_total_range = opening_totals["total"].max() - opening_totals["total"].min()
+ closing_total_range = closing_totals["total"].max() - closing_totals["total"].min()
+ num_books_total = totals["book_key"].nunique()
+ total_movement = closing_total - opening_total
+ else:
+ opening_total = None
+ closing_total = None
+ opening_total_range = None
+ closing_total_range = None
+ num_books_total = None
+ total_movement = None
+
+ features_list.append(
+ {
+ "event_id": event_id,
+ "home_team": home_team,
+ "away_team": away_team,
+ # Spreads features
+ "consensus_opening_spread_magnitude": opening_spread_magnitude,
+ "consensus_closing_spread_magnitude": closing_spread_magnitude,
+ "opening_spread_range": opening_spread_range,
+ "closing_spread_range": closing_spread_range,
+ "num_books_spread": num_books_spread,
+ "spread_magnitude_movement": abs(
+ closing_spread_magnitude - opening_spread_magnitude
+ ),
+ "opening_home_implied_prob": opening_home_implied_prob,
+ "closing_home_implied_prob": closing_home_implied_prob,
+ "home_is_favorite": 1 if home_is_favorite else 0,
+ # Totals features
+ "consensus_opening_total": opening_total,
+ "consensus_closing_total": closing_total,
+ "opening_total_range": opening_total_range,
+ "closing_total_range": closing_total_range,
+ "num_books_total": num_books_total,
+ "total_movement": total_movement,
+ }
+ )
+
+ features_df = pd.DataFrame(features_list)
+ logger.info(f"Extracted features for {len(features_df)} events")
+
+ return features_df
+
+
+def main() -> None:
+ """Generate predictions for today's games."""
+ parser = argparse.ArgumentParser(description="Generate predictions with legacy models")
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=Path("predictions") / f"{datetime.now(PST).date()}.csv",
+ help="Output CSV path",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Database path",
+ )
+
+ args = parser.parse_args()
+
+ logger.info("[OK] === Generating Predictions with Legacy Models ===\n")
+
+ # Train single models with best seed (1024) instead of loading ensemble
+ # This avoids pickle import issues and uses the best performing seed
+ logger.info("Training models with best seed (1024)...")
+
+ import xgboost as xgb
+
+ from sports_betting_edge.adapters.filesystem import read_parquet_df
+
+ # Load legacy dataset
+ df = read_parquet_df("data/staging/complete_dataset.parquet")
+
+ # Train spreads model
+ spreads_features = [
+ "consensus_opening_spread_magnitude",
+ "consensus_closing_spread_magnitude",
+ "opening_spread_range",
+ "closing_spread_range",
+ "num_books_spread",
+ "spread_magnitude_movement",
+ "opening_home_implied_prob",
+ "closing_home_implied_prob",
+ "home_is_favorite",
+ ]
+ spreads_df_clean = df.dropna(subset=spreads_features + ["home_covered_spread"])
+ X_spreads_train = spreads_df_clean[spreads_features]
+ y_spreads_train = spreads_df_clean["home_covered_spread"]
+
+ spreads_params = {
+ "n_estimators": 300,
+ "max_depth": 6,
+ "learning_rate": 0.1,
+ "min_child_weight": 5,
+ "gamma": 1.0,
+ "reg_alpha": 1.0,
+ "reg_lambda": 1.0,
+ "subsample": 0.8,
+ "colsample_bytree": 0.8,
+ "objective": "binary:logistic",
+ "random_state": 1024, # Best seed
+ }
+ spreads_model = xgb.XGBClassifier(**spreads_params)
+ spreads_model.fit(X_spreads_train, y_spreads_train, verbose=False)
+
+ # Train totals model
+ totals_features = [
+ "consensus_opening_total",
+ "consensus_closing_total",
+ "opening_total_range",
+ "closing_total_range",
+ "num_books_total",
+ "total_movement",
+ ]
+ totals_df_clean = df.dropna(subset=totals_features + ["went_over"])
+ X_totals_train = totals_df_clean[totals_features]
+ y_totals_train = totals_df_clean["went_over"]
+
+ totals_params = {
+ "n_estimators": 300,
+ "max_depth": 6,
+ "learning_rate": 0.1,
+ "min_child_weight": 5,
+ "gamma": 1.0,
+ "reg_alpha": 1.0,
+ "reg_lambda": 1.0,
+ "subsample": 0.8,
+ "colsample_bytree": 0.8,
+ "objective": "binary:logistic",
+ "random_state": 1024, # Best seed
+ }
+ totals_model = xgb.XGBClassifier(**totals_params)
+ totals_model.fit(X_totals_train, y_totals_train, verbose=False)
+
+ logger.info(" Spreads model loaded ✅")
+ logger.info(" Totals model loaded ✅\n")
+
+ # Get today's games
+ db = OddsAPIDatabase(args.db_path)
+
+ query = """
+ SELECT event_id, home_team, away_team, commence_time
+ FROM events
+ WHERE DATE(commence_time) >= DATE('now')
+ AND DATE(commence_time) <= DATE('now', '+1 day')
+ ORDER BY commence_time
+ """
+
+ events_df = pd.read_sql_query(query, db.conn)
+ logger.info(f"Found {len(events_df)} games\n")
+
+ if len(events_df) == 0:
+ logger.warning("No games found for today/tomorrow")
+ return
+
+ # Extract features
+ features_df = extract_legacy_features(db, events_df)
+
+ # Merge with event info
+ predictions_df = events_df.merge(
+ features_df, on="event_id", how="inner", suffixes=("_event", "")
+ )
+
+ # Make spreads predictions
+ spreads_features = [
+ "consensus_opening_spread_magnitude",
+ "consensus_closing_spread_magnitude",
+ "opening_spread_range",
+ "closing_spread_range",
+ "num_books_spread",
+ "spread_magnitude_movement",
+ "opening_home_implied_prob",
+ "closing_home_implied_prob",
+ "home_is_favorite",
+ ]
+
+ spreads_df = predictions_df.dropna(subset=spreads_features)
+ logger.info(f"Generating spreads predictions for {len(spreads_df)} games...")
+
+ if len(spreads_df) > 0:
+ X_spreads = spreads_df[spreads_features]
+ spreads_probs = spreads_model.predict_proba(X_spreads)[:, 1] # Get positive class
+ spreads_df.loc[:, "home_spread_prob"] = spreads_probs
+ spreads_df.loc[:, "away_spread_prob"] = 1 - spreads_probs
+
+ # Make totals predictions
+ totals_features = [
+ "consensus_opening_total",
+ "consensus_closing_total",
+ "opening_total_range",
+ "closing_total_range",
+ "num_books_total",
+ "total_movement",
+ ]
+
+ totals_df = predictions_df.dropna(subset=totals_features)
+ logger.info(f"Generating totals predictions for {len(totals_df)} games...")
+
+ if len(totals_df) > 0:
+ X_totals = totals_df[totals_features]
+ over_probs = totals_model.predict_proba(X_totals)[:, 1] # Get positive class
+ totals_df.loc[:, "over_prob"] = over_probs
+ totals_df.loc[:, "under_prob"] = 1 - over_probs
+
+ # Combine predictions
+ final_df = predictions_df[["event_id", "home_team", "away_team", "commence_time"]].copy()
+
+ if len(spreads_df) > 0:
+ final_df = final_df.merge(
+ spreads_df[
+ [
+ "event_id",
+ "consensus_opening_spread_magnitude",
+ "consensus_closing_spread_magnitude",
+ "home_spread_prob",
+ "away_spread_prob",
+ ]
+ ],
+ on="event_id",
+ how="left",
+ )
+
+ if len(totals_df) > 0:
+ final_df = final_df.merge(
+ totals_df[
+ [
+ "event_id",
+ "consensus_opening_total",
+ "consensus_closing_total",
+ "over_prob",
+ "under_prob",
+ ]
+ ],
+ on="event_id",
+ how="left",
+ )
+
+ # Sort by commence time
+ final_df = final_df.sort_values("commence_time")
+
+ # Save
+ args.output.parent.mkdir(parents=True, exist_ok=True)
+ final_df.to_csv(args.output, index=False)
+
+ logger.info(f"\n[OK] Saved {len(final_df)} predictions to {args.output}")
+
+ # Show summary
+ logger.info("\n=== Prediction Summary ===")
+ logger.info(f" Total games: {len(final_df)}")
+ logger.info(f" With spreads: {final_df['home_spread_prob'].notna().sum()}")
+ logger.info(f" With totals: {final_df['over_prob'].notna().sum()}")
+
+ if len(final_df) > 0:
+ logger.info("\nTop 5 Confident Spreads Predictions:")
+ spreads_confident = final_df.dropna(subset=["home_spread_prob"]).copy()
+ spreads_confident["home_confidence"] = spreads_confident["home_spread_prob"].apply(
+ lambda x: abs(x - 0.5)
+ )
+ top_spreads = spreads_confident.nlargest(5, "home_confidence")
+
+ for _, row in top_spreads.iterrows():
+ logger.info(
+ f" {row['home_team']}: "
+ f"{row['home_spread_prob']:.1%} to cover | "
+ f"{row['away_team']}: {row['away_spread_prob']:.1%} to cover"
+ )
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/.gitkeep b/scripts/processing/.gitkeep
new file mode 100644
index 000000000..e69de29bb
diff --git a/scripts/processing/add_missing_team_mappings.py b/scripts/processing/add_missing_team_mappings.py
new file mode 100644
index 000000000..cc9e66e2e
--- /dev/null
+++ b/scripts/processing/add_missing_team_mappings.py
@@ -0,0 +1,101 @@
+"""Add missing team mappings identified in Phase 1 audit."""
+
+from __future__ import annotations
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df, write_parquet
+
+
+def main() -> None:
+ """Add 5 confirmed team mappings for Odds API names."""
+ # Read existing mappings
+ mapping_path = "data/staging/mappings/team_mapping.parquet"
+ df = read_parquet_df(mapping_path)
+
+ print(f"Current mappings: {len(df)}")
+
+ # New mappings to add (only teams confirmed in KenPom)
+ new_mappings = [
+ {
+ "kenpom_name": "Alabama St.",
+ "odds_api_name": "Alabama State Hornets",
+ "odds_api_score": 0.90, # Manual match
+ "espn_name": None, # Will be filled by ESPN mapper
+ "espn_score": None,
+ },
+ {
+ "kenpom_name": "Nicholls",
+ "odds_api_name": "Nicholls Colonels",
+ "odds_api_score": 0.90,
+ "espn_name": None,
+ "espn_score": None,
+ },
+ {
+ "kenpom_name": "St. Thomas",
+ "odds_api_name": "St. Thomas-Minnesota Tommies",
+ "odds_api_score": 0.90,
+ "espn_name": None,
+ "espn_score": None,
+ },
+ {
+ "kenpom_name": "Texas A&M Corpus Chris",
+ "odds_api_name": "Texas A&M-Corpus Christi Islanders",
+ "odds_api_score": 0.90,
+ "espn_name": None,
+ "espn_score": None,
+ },
+ {
+ "kenpom_name": "Tennessee Martin",
+ "odds_api_name": "UT Martin Skyhawks",
+ "odds_api_score": 0.90,
+ "espn_name": None,
+ "espn_score": None,
+ },
+ ]
+
+ # Check for duplicates before adding
+ existing_kenpom = set(df["kenpom_name"].values)
+ existing_odds_api = set(df["odds_api_name"].dropna().values)
+
+ to_add = []
+ for mapping in new_mappings:
+ if mapping["kenpom_name"] in existing_kenpom:
+ print(f"[SKIP] {mapping['kenpom_name']} already mapped")
+ continue
+ if mapping["odds_api_name"] in existing_odds_api:
+ print(f"[SKIP] {mapping['odds_api_name']} already mapped")
+ continue
+ to_add.append(mapping)
+ print(f"[ADD] {mapping['kenpom_name']} <- {mapping['odds_api_name']}")
+
+ if len(to_add) == 0:
+ print("\nNo new mappings to add (all already exist)")
+ return
+
+ # Append new mappings
+ new_df = pd.DataFrame(to_add)
+ updated_df = pd.concat([df, new_df], ignore_index=True)
+
+ # Save updated mappings
+ write_parquet(updated_df, mapping_path)
+
+ print(f"\nUpdated mappings: {len(updated_df)} (added {len(to_add)})")
+
+ # Document unmappable teams (not in KenPom D1)
+ print("\n" + "=" * 60)
+ print("UNMAPPABLE TEAMS (not in KenPom D1 database):")
+ print("=" * 60)
+ unmappable = [
+ "Bryant & Stratton (Ohio) Bobcats - NAIA/D3 school",
+ "Elizabeth City State Vikings - D2 school",
+ "Elms College Blazers - D3 school",
+ "Morehouse Maroon Tigers - D2 school",
+ ]
+ for team in unmappable:
+ print(f" - {team}")
+ print("\nThese teams appear in exhibition games and cannot generate features.")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/build_team_mapping.py b/scripts/processing/build_team_mapping.py
new file mode 100644
index 000000000..4a07006f2
--- /dev/null
+++ b/scripts/processing/build_team_mapping.py
@@ -0,0 +1,137 @@
+"""Build canonical team mapping table from multiple data sources.
+
+This script creates a master team mapping table that allows us to join
+data across KenPom, ESPN, Overtime.ag, and The Odds API by normalizing
+team names to a single canonical identifier.
+
+Usage:
+ uv run python scripts/build_team_mapping.py
+
+Output:
+ data/processed/team_mapping.parquet - Canonical team mapping table
+"""
+
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df, write_parquet
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def load_kenpom_teams(season: int = 2026) -> pd.DataFrame:
+ """Load KenPom team data for a given season.
+
+ Args:
+ season: The season year (default: 2026 for current season)
+
+ Returns:
+ DataFrame with KenPom team information
+ """
+ kenpom_path = Path(f"data/kenpom/teams/season/teams_{season}.parquet")
+ if not kenpom_path.exists():
+ raise FileNotFoundError(f"KenPom team data not found: {kenpom_path}")
+
+ df = read_parquet_df(str(kenpom_path))
+ logger.info(f"Loaded {len(df)} teams from KenPom {season} season")
+ return df
+
+
+def create_canonical_mapping(kenpom_df: pd.DataFrame) -> pd.DataFrame:
+ """Create canonical team mapping table from KenPom base data.
+
+ Args:
+ kenpom_df: KenPom team DataFrame
+
+ Returns:
+ Canonical team mapping DataFrame
+ """
+ mapping = pd.DataFrame(
+ {
+ # Canonical identifiers (using KenPom as base)
+ "canonical_team_id": kenpom_df["TeamID"],
+ "canonical_name": kenpom_df["TeamName"],
+ "conference": kenpom_df["ConfShort"],
+ "division": "D1", # All KenPom teams are D1
+ # KenPom fields
+ "kenpom_id": kenpom_df["TeamID"],
+ "kenpom_name": kenpom_df["TeamName"],
+ # ESPN fields (to be populated)
+ "espn_id": pd.NA,
+ "espn_display_name": pd.NA,
+ "espn_abbreviation": pd.NA,
+ "espn_slug": pd.NA,
+ # Overtime.ag fields (to be populated)
+ "overtime_name": pd.NA,
+ # Odds API fields (to be populated)
+ "odds_api_name": pd.NA,
+ }
+ )
+
+ # Convert to appropriate dtypes
+ mapping = mapping.astype(
+ {
+ "canonical_team_id": "int64",
+ "canonical_name": "string",
+ "conference": "string",
+ "division": "string",
+ "kenpom_id": "int64",
+ "kenpom_name": "string",
+ "espn_id": "Int64", # Nullable integer
+ "espn_display_name": "string",
+ "espn_abbreviation": "string",
+ "espn_slug": "string",
+ "overtime_name": "string",
+ "odds_api_name": "string",
+ }
+ )
+
+ logger.info(f"Created canonical mapping with {len(mapping)} teams")
+ return mapping
+
+
+def save_team_mapping(mapping_df: pd.DataFrame, output_path: Path) -> None:
+ """Save team mapping table to parquet.
+
+ Args:
+ mapping_df: Team mapping DataFrame
+ output_path: Path to save the parquet file
+ """
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ write_parquet(mapping_df, str(output_path), index=False)
+ logger.info(f"Saved team mapping to {output_path}")
+
+
+def main() -> None:
+ """Build canonical team mapping from KenPom data."""
+ logger.info("Starting team mapping build process...")
+
+ # Load KenPom data (base for canonical mapping)
+ kenpom_df = load_kenpom_teams(season=2026)
+
+ # Create canonical mapping table
+ mapping_df = create_canonical_mapping(kenpom_df)
+
+ # Display summary
+ logger.info("\nTeam Mapping Summary:")
+ logger.info(f" Total teams: {len(mapping_df)}")
+ logger.info(f" Conferences: {mapping_df['conference'].nunique()}")
+ logger.info(
+ f"\nTop 5 conferences by team count:\n{mapping_df['conference'].value_counts().head()}"
+ )
+
+ # Save to parquet
+ output_path = Path("data/processed/team_mapping.parquet")
+ save_team_mapping(mapping_df, output_path)
+
+ logger.info("\nNext steps:")
+ logger.info(" 1. Run: python scripts/map_espn_teams.py")
+ logger.info(" 2. Run: python scripts/map_overtime_teams.py")
+ logger.info(" 3. Run: python scripts/map_odds_api_teams.py")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/build_training_datasets.py b/scripts/processing/build_training_datasets.py
new file mode 100644
index 000000000..cf297cc0e
--- /dev/null
+++ b/scripts/processing/build_training_datasets.py
@@ -0,0 +1,108 @@
+"""Build XGBoost training datasets from integrated data sources.
+
+Combines KenPom, line movement, and game outcomes into ML-ready datasets.
+
+Usage:
+ # Build datasets for December 2025
+ uv run python scripts/build_training_datasets.py --start 2025-12-01 --end 2025-12-31
+
+ # Build for specific output path
+ uv run python scripts/build_training_datasets.py --output data/ml/training_2025.parquet
+"""
+
+import argparse
+import logging
+from pathlib import Path
+
+from sports_betting_edge.adapters.filesystem import write_parquet
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def main() -> None:
+ """Build training datasets."""
+ parser = argparse.ArgumentParser(description="Build ML training datasets")
+ parser.add_argument(
+ "--start",
+ type=str,
+ default="2025-12-01",
+ help="Start date (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--end",
+ type=str,
+ default="2026-01-31",
+ help="End date (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--output-dir",
+ type=str,
+ default="data/ml",
+ help="Output directory for datasets",
+ )
+ parser.add_argument(
+ "--season",
+ type=int,
+ default=2026,
+ help="KenPom season year",
+ )
+
+ args = parser.parse_args()
+
+ logger.info("Building training datasets...")
+ logger.info(f"Date range: {args.start} to {args.end}")
+
+ output_dir = Path(args.output_dir)
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ engineer = FeatureEngineer(staging_path="data/staging/")
+
+ # Build spreads dataset
+ logger.info("\n=== Building Spreads Dataset ===")
+ X_spreads, y_spreads = engineer.build_spreads_dataset(
+ start_date=args.start,
+ end_date=args.end,
+ season=args.season,
+ )
+
+ # Save spreads dataset
+ spreads_output = output_dir / f"spreads_{args.start}_{args.end}.parquet"
+ spreads_data = X_spreads.copy()
+ spreads_data["target"] = y_spreads
+ write_parquet(spreads_data.to_dict(orient="records"), spreads_output) # type: ignore[arg-type]
+ logger.info(f"[OK] Saved spreads dataset -> {spreads_output}")
+ logger.info(f" {len(spreads_data)} games, {len(X_spreads.columns)} features")
+
+ # Build totals dataset
+ logger.info("\n=== Building Totals Dataset ===")
+ X_totals, y_totals = engineer.build_totals_dataset(
+ start_date=args.start,
+ end_date=args.end,
+ season=args.season,
+ )
+
+ # Save totals dataset
+ totals_output = output_dir / f"totals_{args.start}_{args.end}.parquet"
+ totals_data = X_totals.copy()
+ totals_data["target"] = y_totals
+ write_parquet(totals_data.to_dict(orient="records"), totals_output) # type: ignore[arg-type]
+ logger.info(f"[OK] Saved totals dataset -> {totals_output}")
+ logger.info(f" {len(totals_data)} games, {len(X_totals.columns)} features")
+
+ # Summary
+ logger.info("\n=== Summary ===")
+ logger.info(f"Spreads: {len(X_spreads)} games")
+ logger.info(f" Favorite covered: {y_spreads.sum()} ({y_spreads.mean():.1%})")
+ logger.info(f" Favorite failed: {(~y_spreads.astype(bool)).sum()}")
+
+ logger.info(f"\nTotals: {len(X_totals)} games")
+ logger.info(f" Went over: {y_totals.sum()} ({y_totals.mean():.1%})")
+ logger.info(f" Went under: {(~y_totals.astype(bool)).sum()}")
+
+ logger.info("\n[OK] Training datasets built successfully!")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/consolidate_parquet_files.py b/scripts/processing/consolidate_parquet_files.py
new file mode 100644
index 000000000..b7c74f67c
--- /dev/null
+++ b/scripts/processing/consolidate_parquet_files.py
@@ -0,0 +1,276 @@
+"""Consolidate daily parquet files into season-level files with temporal columns.
+
+This script consolidates small daily parquet files into larger season files
+to improve storage efficiency and query performance. Original daily files are
+archived after consolidation.
+
+Usage:
+ uv run python scripts/processing/consolidate_parquet_files.py --category espn
+ uv run python scripts/processing/consolidate_parquet_files.py \
+ --category kenpom --kenpom-type four-factors
+ uv run python scripts/processing/consolidate_parquet_files.py --all
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+logger = logging.getLogger(__name__)
+
+
+def consolidate_espn_schedule(season: int = 2026, dry_run: bool = False) -> None:
+ """Consolidate ESPN schedule files into one season file.
+
+ Args:
+ season: Season year to consolidate
+ dry_run: If True, only show what would be done without making changes
+ """
+ schedule_dir = Path("data/espn/schedule")
+ if not schedule_dir.exists():
+ logger.error(f"ESPN schedule directory not found: {schedule_dir}")
+ return
+
+ daily_files = sorted(schedule_dir.glob(f"{season}-*.parquet"))
+
+ if not daily_files:
+ logger.warning(f"No daily files found for season {season}")
+ return
+
+ logger.info(f"Found {len(daily_files)} daily files to consolidate")
+
+ if dry_run:
+ logger.info("[DRY RUN] Would consolidate the following files:")
+ for file in daily_files:
+ logger.info(f" - {file.name}")
+ return
+
+ # Read all daily files
+ dfs = []
+ for file in daily_files:
+ try:
+ df = pd.read_parquet(file)
+ dfs.append(df)
+ logger.debug(f"Read {file.name}: {len(df)} rows")
+ except Exception as e:
+ logger.error(f"Failed to read {file}: {e}")
+ continue
+
+ if not dfs:
+ logger.error("No data files could be read")
+ return
+
+ # Concatenate all dataframes
+ combined = pd.concat(dfs, ignore_index=True)
+ logger.info(f"Combined data: {len(combined)} total rows")
+
+ # Remove duplicates, keeping latest capture
+ # Group by game_id and game_date, keep row with most recent captured_at
+ combined = combined.sort_values("captured_at", ascending=False)
+ combined = combined.drop_duplicates(subset=["game_id", "game_date"], keep="first")
+ logger.info(f"After deduplication: {len(combined)} unique games")
+
+ # Sort by game_date for better compression and readability
+ combined = combined.sort_values(["game_date", "captured_at"])
+
+ # Write consolidated file
+ output = schedule_dir / f"espn_schedule_{season}.parquet"
+ combined.to_parquet(output, index=False)
+ logger.info(f"Wrote consolidated file: {output} ({len(combined)} rows)")
+
+ # Archive old daily files
+ archive_dir = schedule_dir / "archive" / "daily"
+ archive_dir.mkdir(parents=True, exist_ok=True)
+
+ for file in daily_files:
+ archive_path = archive_dir / file.name
+ file.rename(archive_path)
+ logger.debug(f"Archived {file.name} -> {archive_path}")
+
+ logger.info(f"Archived {len(daily_files)} daily files to {archive_dir}")
+
+
+def consolidate_kenpom_category(category: str, season: int = 2026, dry_run: bool = False) -> None:
+ """Consolidate KenPom daily files into one historical file.
+
+ Args:
+ category: KenPom category (e.g., 'four-factors', 'efficiency')
+ season: Season year to consolidate
+ dry_run: If True, only show what would be done without making changes
+ """
+ daily_dir = Path(f"data/kenpom/{category}/daily")
+ if not daily_dir.exists():
+ logger.warning(f"KenPom daily directory not found: {daily_dir}")
+ return
+
+ # Match files like "four-factors_2026-02-06.parquet"
+ daily_files = sorted(daily_dir.glob(f"{category}_{season}-*.parquet"))
+
+ if not daily_files:
+ logger.warning(f"No daily files found for {category} season {season}")
+ return
+
+ logger.info(f"Found {len(daily_files)} daily files to consolidate for {category}")
+
+ if dry_run:
+ logger.info(f"[DRY RUN] Would consolidate {category} files:")
+ for file in daily_files:
+ logger.info(f" - {file.name}")
+ return
+
+ # Read all daily files
+ dfs = []
+ for file in daily_files:
+ try:
+ df = pd.read_parquet(file)
+
+ # Normalize temporal column names: captured_at -> fetched_at
+ if "captured_at" in df.columns and "fetched_at" not in df.columns:
+ df = df.rename(columns={"captured_at": "fetched_at"})
+ logger.debug("Renamed captured_at -> fetched_at")
+
+ # Ensure temporal column exists
+ if "fetched_at" not in df.columns and "DataThrough" not in df.columns:
+ # Extract date from filename (e.g., "four-factors_2026-02-06.parquet")
+ date_str = file.stem.split("_")[-1] # "2026-02-06"
+ df["fetched_at"] = pd.to_datetime(date_str)
+ logger.debug(f"Added fetched_at column from filename: {date_str}")
+
+ dfs.append(df)
+ logger.debug(f"Read {file.name}: {len(df)} rows")
+ except Exception as e:
+ logger.error(f"Failed to read {file}: {e}")
+ continue
+
+ if not dfs:
+ logger.error(f"No data files could be read for {category}")
+ return
+
+ # Check schema consistency before concatenating
+ if len(dfs) > 1:
+ first_columns = set(dfs[0].columns)
+ for i, df in enumerate(dfs[1:], start=2):
+ if set(df.columns) != first_columns:
+ logger.error(f"Schema mismatch in {category}: File {i} has different columns")
+ logger.error(f" First file: {sorted(first_columns)}")
+ logger.error(f" File {i}: {sorted(df.columns)}")
+ logger.error(f"Skipping {category} consolidation - manual intervention needed")
+ return
+
+ # Concatenate all dataframes
+ combined = pd.concat(dfs, ignore_index=True)
+ logger.info(f"Combined {category} data: {len(combined)} total rows")
+
+ # Note: We don't deduplicate KenPom data because each snapshot is valuable
+ # for tracking how ratings changed over time
+
+ # Create historical directory
+ hist_dir = daily_dir.parent / "historical"
+ hist_dir.mkdir(exist_ok=True)
+
+ # Write consolidated file
+ output = hist_dir / f"{category}_{season}_daily.parquet"
+ combined.to_parquet(output, index=False)
+ logger.info(f"Wrote consolidated file: {output} ({len(combined)} rows)")
+
+ # Archive old daily files
+ archive_dir = daily_dir.parent / "archive" / "daily"
+ archive_dir.mkdir(parents=True, exist_ok=True)
+
+ for file in daily_files:
+ archive_path = archive_dir / file.name
+ file.rename(archive_path)
+ logger.debug(f"Archived {file.name} -> {archive_path}")
+
+ logger.info(f"Archived {len(daily_files)} daily files to {archive_dir}")
+
+
+def main() -> None:
+ """Main consolidation orchestrator."""
+ parser = argparse.ArgumentParser(
+ description="Consolidate daily parquet files into season files"
+ )
+ parser.add_argument(
+ "--category",
+ choices=["espn", "kenpom"],
+ help="Category to consolidate (espn or kenpom)",
+ )
+ parser.add_argument(
+ "--kenpom-type",
+ help="KenPom category type (e.g., four-factors, efficiency, ratings)",
+ )
+ parser.add_argument(
+ "--all",
+ action="store_true",
+ help="Consolidate all categories",
+ )
+ parser.add_argument(
+ "--season",
+ type=int,
+ default=2026,
+ help="Season year (default: 2026)",
+ )
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Show what would be done without making changes",
+ )
+ parser.add_argument(
+ "-v",
+ "--verbose",
+ action="store_true",
+ help="Enable verbose logging",
+ )
+
+ args = parser.parse_args()
+
+ # Configure logging
+ logging.basicConfig(
+ level=logging.DEBUG if args.verbose else logging.INFO,
+ format="%(levelname)s: %(message)s",
+ )
+
+ if args.dry_run:
+ logger.info("=== DRY RUN MODE ===")
+
+ if args.all:
+ # Consolidate ESPN
+ logger.info("=== Consolidating ESPN Schedule ===")
+ consolidate_espn_schedule(season=args.season, dry_run=args.dry_run)
+
+ # Consolidate all KenPom categories
+ kenpom_categories = [
+ "four-factors",
+ "efficiency",
+ "fanmatch",
+ "ratings",
+ "conf-ratings",
+ ]
+
+ for category in kenpom_categories:
+ logger.info(f"=== Consolidating KenPom {category} ===")
+ consolidate_kenpom_category(category=category, season=args.season, dry_run=args.dry_run)
+
+ elif args.category == "espn":
+ logger.info("=== Consolidating ESPN Schedule ===")
+ consolidate_espn_schedule(season=args.season, dry_run=args.dry_run)
+
+ elif args.category == "kenpom":
+ if not args.kenpom_type:
+ logger.error("--kenpom-type required when --category=kenpom")
+ return
+
+ logger.info(f"=== Consolidating KenPom {args.kenpom_type} ===")
+ consolidate_kenpom_category(
+ category=args.kenpom_type, season=args.season, dry_run=args.dry_run
+ )
+
+ else:
+ parser.print_help()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/consolidate_staging.py b/scripts/processing/consolidate_staging.py
new file mode 100644
index 000000000..8260b9401
--- /dev/null
+++ b/scripts/processing/consolidate_staging.py
@@ -0,0 +1,1255 @@
+"""Consolidate raw data sources into ML-ready staging layer.
+
+Rebuilds the staging directory with pre-computed, feature-engineered datasets:
+- events.parquet: Unified event catalog with scores
+- line_features.parquet: Pre-computed line movement features from SQLite views
+- team_ratings.parquet: Latest KenPom ratings merged with four factors
+
+This script should run nightly after raw data collection (collect_hybrid.py).
+Staging data is ephemeral and rebuilt from raw sources each time.
+
+Usage:
+ # Rebuild staging layer
+ uv run python scripts/consolidate_staging.py
+
+ # Dry run (validate only, no writes)
+ uv run python scripts/consolidate_staging.py --dry-run
+
+ # Force rebuild even if recent
+ uv run python scripts/consolidate_staging.py --force
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import (
+ read_parquet_df,
+)
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.config.settings import settings
+from sports_betting_edge.utils.datetime_utils import (
+ convert_series_to_pacific,
+ now_utc,
+ parse_series,
+)
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def check_if_stale(staging_path: Path, max_age_hours: int = 24) -> bool:
+ """Check if staging data needs rebuild based on metadata timestamp.
+
+ Args:
+ staging_path: Path to staging directory
+ max_age_hours: Maximum age in hours before considering stale
+
+ Returns:
+ True if staging is stale or missing, False if recent
+ """
+ metadata_file = staging_path / "metadata.json"
+
+ if not metadata_file.exists():
+ logger.info("No metadata found - staging rebuild needed")
+ return True
+
+ try:
+ with open(metadata_file) as f:
+ metadata = json.load(f)
+
+ built_at_str = metadata.get("built_at")
+ if not built_at_str:
+ logger.warning("Metadata missing built_at timestamp - rebuild needed")
+ return True
+
+ # Parse metadata timestamp (now in "YYYY-MM-DD HH:MM:SS UTC" format)
+ try:
+ built_at = datetime.strptime(built_at_str, "%Y-%m-%d %H:%M:%S UTC")
+ except ValueError:
+ # Fallback for old ISO format timestamps
+ built_at = datetime.fromisoformat(built_at_str.replace("Z", "+00:00"))
+
+ age = datetime.now() - built_at
+ age_hours = age.total_seconds() / 3600
+
+ if age_hours > max_age_hours:
+ logger.info(f"Staging is stale ({age_hours:.1f}h old, max {max_age_hours}h)")
+ return True
+
+ logger.info(f"Staging is recent ({age_hours:.1f}h old)")
+ return False
+
+ except (json.JSONDecodeError, ValueError) as e:
+ logger.warning(f"Could not parse metadata: {e} - rebuild needed")
+ return True
+
+
+def calculate_rest_features(events: pd.DataFrame) -> pd.DataFrame:
+ """Calculate rest days and road streak features for each game.
+
+ For each game, calculates situational factors:
+ - Days since last game (rest days)
+ - Back-to-back indicator (0 days rest)
+ - Short rest indicator (1 day rest)
+ - Road streak (consecutive away games before this one)
+ - Days since last home game (for away teams)
+
+ Args:
+ events: Events DataFrame with commence_time, home_team, away_team
+
+ Returns:
+ DataFrame with rest features added as new columns
+ """
+ logger.info("Calculating rest days and road streak features...")
+
+ # Ensure sorted by time
+ events = events.sort_values("commence_time").copy()
+
+ # Initialize feature lists
+ rest_features = {
+ "home_rest_days": [],
+ "away_rest_days": [],
+ "home_back_to_back": [],
+ "away_back_to_back": [],
+ "home_short_rest": [],
+ "away_short_rest": [],
+ "away_road_streak": [],
+ "away_days_on_road": [],
+ }
+
+ # Track team history: {team_name: {
+ # 'last_game': datetime, 'last_home_game': datetime, 'road_streak': int
+ # }}
+ team_history: dict[str, dict] = {}
+
+ for _, row in events.iterrows():
+ home_team = row["home_team"]
+ away_team = row["away_team"]
+ game_time = row["commence_time"]
+
+ # === HOME TEAM FEATURES ===
+ if home_team in team_history:
+ last_game = team_history[home_team]["last_game"]
+ rest_days = (game_time - last_game).days
+ rest_features["home_rest_days"].append(rest_days)
+ rest_features["home_back_to_back"].append(rest_days == 0)
+ rest_features["home_short_rest"].append(rest_days == 1)
+ else:
+ # First game of season - use neutral defaults
+ rest_features["home_rest_days"].append(3) # Assume typical 3-day rest
+ rest_features["home_back_to_back"].append(False)
+ rest_features["home_short_rest"].append(False)
+
+ # === AWAY TEAM FEATURES ===
+ if away_team in team_history:
+ last_game = team_history[away_team]["last_game"]
+ rest_days = (game_time - last_game).days
+ road_streak = team_history[away_team]["road_streak"]
+ last_home = team_history[away_team]["last_home_game"]
+
+ rest_features["away_rest_days"].append(rest_days)
+ rest_features["away_back_to_back"].append(rest_days == 0)
+ rest_features["away_short_rest"].append(rest_days == 1)
+ rest_features["away_road_streak"].append(road_streak)
+
+ if last_home is not None:
+ rest_features["away_days_on_road"].append((game_time - last_home).days)
+ else:
+ rest_features["away_days_on_road"].append(0) # First game was away
+ else:
+ # First game of season
+ rest_features["away_rest_days"].append(3) # Assume typical 3-day rest
+ rest_features["away_back_to_back"].append(False)
+ rest_features["away_short_rest"].append(False)
+ rest_features["away_road_streak"].append(0)
+ rest_features["away_days_on_road"].append(0)
+
+ # === UPDATE TEAM HISTORY ===
+ # Home team: playing at home resets road streak
+ team_history[home_team] = {
+ "last_game": game_time,
+ "last_home_game": game_time,
+ "road_streak": 0,
+ }
+
+ # Away team: increment road streak
+ if away_team not in team_history:
+ team_history[away_team] = {
+ "last_game": game_time,
+ "last_home_game": None,
+ "road_streak": 1, # This game starts their streak
+ }
+ else:
+ team_history[away_team]["last_game"] = game_time
+ team_history[away_team]["road_streak"] += 1
+
+ # Add features to events
+ for feature_name, values in rest_features.items():
+ events[feature_name] = values
+
+ # Log summary statistics
+ home_rest_avg = events["home_rest_days"].mean()
+ away_rest_avg = events["away_rest_days"].mean()
+ logger.info(f" Rest days: home avg={home_rest_avg:.1f}, away avg={away_rest_avg:.1f}")
+
+ home_b2b = events["home_back_to_back"].sum()
+ away_b2b = events["away_back_to_back"].sum()
+ logger.info(f" Back-to-back games: home={home_b2b}, away={away_b2b}")
+
+ road_max = events["away_road_streak"].max()
+ road_avg = events["away_road_streak"].mean()
+ logger.info(f" Road streaks: max={road_max}, avg={road_avg:.1f}")
+
+ return events
+
+
+def calculate_rolling_metrics(events: pd.DataFrame) -> pd.DataFrame:
+ """Calculate rolling performance metrics for each team.
+
+ Tracks each team's recent results and computes rolling averages
+ over last 5 and last 10 games. Uses the same chronological
+ iteration pattern as calculate_rest_features().
+
+ Args:
+ events: Events DataFrame with scores, sorted by commence_time
+
+ Returns:
+ DataFrame with rolling metric columns added
+ """
+ from collections import deque
+
+ logger.info("Calculating rolling performance metrics...")
+
+ events = events.sort_values("commence_time").copy()
+
+ # Only compute for games with scores
+ has_scores = events["home_score"].notna() & events["away_score"].notna()
+
+ # Initialize output columns with NaN
+ for col in [
+ "home_last5_margin_avg",
+ "away_last5_margin_avg",
+ "home_last10_margin_avg",
+ "away_last10_margin_avg",
+ "home_last5_win_pct",
+ "away_last5_win_pct",
+ "home_win_streak",
+ "away_win_streak",
+ "home_last5_ppg",
+ "away_last5_ppg",
+ ]:
+ events[col] = float("nan")
+
+ # Track per-team history: deque of (margin, points_scored, won)
+ team_history: dict[str, deque[tuple[float, float, bool]]] = {}
+
+ for idx, row in events.iterrows():
+ home = row["home_team"]
+ away = row["away_team"]
+
+ # Compute features BEFORE this game (look-back only)
+ for team, prefix in [(home, "home"), (away, "away")]:
+ if team in team_history and len(team_history[team]) > 0:
+ history = list(team_history[team])
+ margins = [h[0] for h in history]
+ ppg = [h[1] for h in history]
+ wins = [h[2] for h in history]
+
+ # Last 5
+ l5 = min(5, len(history))
+ events.at[idx, f"{prefix}_last5_margin_avg"] = sum(margins[-l5:]) / l5
+ events.at[idx, f"{prefix}_last5_win_pct"] = sum(wins[-l5:]) / l5
+ events.at[idx, f"{prefix}_last5_ppg"] = sum(ppg[-l5:]) / l5
+
+ # Last 10
+ l10 = min(10, len(history))
+ events.at[idx, f"{prefix}_last10_margin_avg"] = sum(margins[-l10:]) / l10
+
+ # Win streak (consecutive recent wins, negative = losses)
+ streak = 0
+ for w in reversed(wins):
+ if w and streak >= 0:
+ streak += 1
+ elif not w and streak <= 0:
+ streak -= 1
+ else:
+ break
+ events.at[idx, f"{prefix}_win_streak"] = streak
+
+ # Update history AFTER computing features (no data leakage)
+ if has_scores.at[idx]:
+ home_score = float(row["home_score"])
+ away_score = float(row["away_score"])
+ home_margin = home_score - away_score
+ away_margin = away_score - home_score
+
+ if home not in team_history:
+ team_history[home] = deque(maxlen=10)
+ team_history[home].append((home_margin, home_score, home_margin > 0))
+
+ if away not in team_history:
+ team_history[away] = deque(maxlen=10)
+ team_history[away].append((away_margin, away_score, away_margin > 0))
+
+ # Log summary
+ valid = events["home_last5_margin_avg"].notna()
+ logger.info(f" Rolling metrics computed for {valid.sum()}/{len(events)} games")
+
+ return events
+
+
+def consolidate_events(
+ db: OddsAPIDatabase,
+ output_path: Path,
+ dry_run: bool = False,
+) -> pd.DataFrame:
+ """Extract events with scores from SQLite database.
+
+ Args:
+ db: OddsAPIDatabase connection
+ output_path: Path to write events.parquet
+ dry_run: If True, skip writing file
+
+ Returns:
+ Events DataFrame with rest and situational features
+ """
+ logger.info("Extracting events with scores from SQLite...")
+
+ # Get events with scores
+ events = db.get_events_with_scores()
+
+ # Add status column (all games with scores are final)
+ events["status"] = "final"
+ events["source"] = "odds_api"
+
+ # Convert commence_time to Pacific timezone using new utilities
+ # Parse ISO8601 strings to UTC timezone-aware datetimes
+ events["commence_time"] = parse_series(events["commence_time"])
+
+ # Convert to Pacific timezone for display and analysis
+ events["commence_time_pacific"] = convert_series_to_pacific(events["commence_time"])
+
+ # Extract date in Pacific timezone (not UTC!)
+ events["game_date"] = events["commence_time_pacific"].dt.date
+
+ logger.info(f" Found {len(events)} completed games")
+ logger.info(f" Date range: {events['game_date'].min()} to {events['game_date'].max()}")
+
+ # Calculate rest days and road streak features
+ events = calculate_rest_features(events)
+
+ # Calculate rolling performance metrics (last 5/10 games)
+ events = calculate_rolling_metrics(events)
+
+ # Enrich with ESPN context (neutral site, conference)
+ events = enrich_with_espn_context(events, dry_run=dry_run)
+
+ if not dry_run:
+ # Write DataFrame directly using pandas
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ events.to_parquet(output_path, index=False)
+ logger.info(f" Wrote {output_path}")
+
+ return events
+
+
+def consolidate_line_features(
+ db: OddsAPIDatabase,
+ event_ids: list[str],
+ output_path: Path,
+ dry_run: bool = False,
+) -> pd.DataFrame:
+ """Extract pre-computed line features from SQLite views.
+
+ Args:
+ db: OddsAPIDatabase connection
+ event_ids: List of event IDs to extract features for
+ output_path: Path to write line_features.parquet
+ dry_run: If True, skip writing file
+
+ Returns:
+ Line features DataFrame with spreads and totals
+ """
+ logger.info("Extracting line movement features from SQLite views...")
+
+ # Build features directly from canonical_spreads and spread_movements
+ # Instead of relying on ml_line_features view which may not exist
+ try:
+ line_features = db.get_ml_line_features(event_ids=event_ids)
+ except Exception as e:
+ logger.warning(f" ml_line_features view failed: {e}")
+ logger.info(" Building features from canonical_spreads instead...")
+
+ # Fallback: Get opening/closing spreads for primary book (fanduel)
+ line_features = db.get_opening_closing_spreads(
+ event_ids=event_ids,
+ book_key="fanduel",
+ )
+
+ logger.info(f" Found spread features for {len(line_features)} games")
+
+ # Add totals data (opening_total, closing_total)
+ logger.info(" Extracting totals data...")
+ totals_list = []
+ for event_id in event_ids:
+ try:
+ totals = db.get_canonical_totals(event_id=event_id, book_key="fanduel")
+ if len(totals) > 0:
+ opening_total = totals.iloc[0]["total"]
+ closing_total = totals.iloc[-1]["total"]
+ totals_list.append(
+ {
+ "event_id": event_id,
+ "opening_total": opening_total,
+ "closing_total": closing_total,
+ }
+ )
+ except Exception:
+ # Skip events without totals
+ pass
+
+ if totals_list:
+ totals_df = pd.DataFrame(totals_list)
+ # Merge totals into line_features
+ line_features = line_features.merge(totals_df, on="event_id", how="left")
+ logger.info(f" Added totals for {len(totals_df)} games")
+ else:
+ logger.warning(" No totals data found")
+
+ # Calculate total movement (if totals data exists)
+ if "opening_total" in line_features.columns and "closing_total" in line_features.columns:
+ line_features["total_movement"] = (
+ line_features["closing_total"] - line_features["opening_total"]
+ )
+ logger.info(" Calculated total_movement feature")
+
+ # Add bias indicator features for meta-learner
+ # These are constant features that help the ensemble meta-learner
+ # learn systematic market biases (e.g., 68.1% underdog edge, 62.7% under edge)
+ if "closing_spread" in line_features.columns:
+ # is_underdog: Constant indicator for spread betting (meta-learner learns bias)
+ # Always 1 because underdog is ALWAYS available in spread markets
+ line_features["is_underdog"] = 1
+
+ # underdog_magnitude: Absolute spread value (how many points underdog gets)
+ line_features["underdog_magnitude"] = line_features["closing_spread"].abs()
+ logger.info(" Added spread bias indicators (is_underdog, underdog_magnitude)")
+
+ if "closing_total" in line_features.columns:
+ # is_under: Constant indicator for totals betting (meta-learner learns bias)
+ # Always 1 because under is ALWAYS available in totals markets
+ line_features["is_under"] = 1
+
+ # total_magnitude: Total value (market's expected combined score)
+ line_features["total_magnitude"] = line_features["closing_total"]
+ logger.info(" Added totals bias indicators (is_under, total_magnitude)")
+
+ # Add line movement features (Day 3-4)
+ # Velocity, divergence, and late movements capture +7.3% line movement edge
+ logger.info(" Extracting line movement features...")
+
+ # Spread velocity (points per hour movement rate)
+ try:
+ velocity_df = db.get_spread_velocity(event_ids=event_ids, book_key="fanduel")
+ if len(velocity_df) > 0:
+ line_features = line_features.merge(
+ velocity_df[["event_id", "spread_velocity", "steam_moves_count"]],
+ on="event_id",
+ how="left",
+ )
+ logger.info(f" Added spread_velocity for {len(velocity_df)} games")
+ except Exception as e:
+ logger.warning(f" Could not extract spread_velocity: {e}")
+
+ # Book divergence (sharp vs public books)
+ try:
+ divergence_df = db.get_book_divergence(event_ids=event_ids)
+ if len(divergence_df) > 0:
+ line_features = line_features.merge(
+ divergence_df[["event_id", "book_divergence", "has_disagreement"]],
+ on="event_id",
+ how="left",
+ )
+ logger.info(f" Added book_divergence for {len(divergence_df)} games")
+ except Exception as e:
+ logger.warning(f" Could not extract book_divergence: {e}")
+
+ # Last hour movement (final hour before game)
+ try:
+ last_hour_df = db.get_last_hour_movement(event_ids=event_ids, book_key="fanduel")
+ if len(last_hour_df) > 0:
+ line_features = line_features.merge(
+ last_hour_df[["event_id", "final_hour_movement", "movement_direction"]],
+ on="event_id",
+ how="left",
+ )
+ logger.info(f" Added final_hour_movement for {len(last_hour_df)} games")
+ except Exception as e:
+ logger.warning(f" Could not extract final_hour_movement: {e}")
+
+ # Convert datetime columns to Pacific timezone
+ if "opening_time" in line_features.columns and len(line_features) > 0:
+ logger.info(" Converting line feature timestamps to Pacific timezone...")
+ line_features["opening_time"] = parse_series(line_features["opening_time"])
+ line_features["opening_time_pacific"] = convert_series_to_pacific(
+ line_features["opening_time"]
+ )
+
+ if "closing_time" in line_features.columns and len(line_features) > 0:
+ line_features["closing_time"] = parse_series(line_features["closing_time"])
+ line_features["closing_time_pacific"] = convert_series_to_pacific(
+ line_features["closing_time"]
+ )
+
+ # Report coverage
+ coverage_pct = (len(line_features) / len(event_ids)) * 100 if event_ids else 0
+ logger.info(f" Feature coverage: {coverage_pct:.1f}% ({len(line_features)}/{len(event_ids)})")
+
+ if not dry_run:
+ # Write DataFrame directly using pandas (simpler than adapter for DataFrames)
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ line_features.to_parquet(output_path, index=False)
+ logger.info(f" Wrote {output_path}")
+
+ return line_features
+
+
+def consolidate_team_ratings(
+ kenpom_path: Path,
+ output_path: Path,
+ season: int = 2026,
+ team_mapping_path: Path | None = None,
+ dry_run: bool = False,
+) -> pd.DataFrame | None:
+ """Merge KenPom ratings, four factors, and height into unified team ratings.
+
+ Args:
+ kenpom_path: Path to KenPom data directory
+ output_path: Path to write team_ratings.parquet
+ season: Season year
+ team_mapping_path: Optional path to team_mapping.parquet for name alignment
+ dry_run: If True, skip writing file
+
+ Returns:
+ Team ratings DataFrame with odds_api_name for merging, or None if KenPom data missing
+ """
+ logger.info(f"Consolidating KenPom data for season {season}...")
+
+ # Load ratings
+ ratings_file = kenpom_path / "ratings" / "season" / f"ratings_{season}.parquet"
+ if not ratings_file.exists():
+ logger.warning(
+ f"KenPom ratings not found: {ratings_file} - "
+ "Skipping team ratings consolidation. "
+ "Run 'uv run python scripts/collect_kenpom_for_pipeline.py' to fetch ratings."
+ )
+ return None
+
+ ratings = read_parquet_df(str(ratings_file))
+ logger.info(f" Loaded {len(ratings)} teams from ratings")
+
+ # Load four factors
+ ff_file = kenpom_path / "four-factors" / "season" / f"four-factors_{season}.parquet"
+ if not ff_file.exists():
+ logger.warning(f"Four factors not found: {ff_file} - continuing without")
+ ff = pd.DataFrame()
+ else:
+ ff = read_parquet_df(str(ff_file))
+ logger.info(f" Loaded {len(ff)} teams from four factors")
+
+ # Load height data
+ height_file = kenpom_path / "height" / "season" / f"height_{season}.parquet"
+ if not height_file.exists():
+ logger.warning(f"Height data not found: {height_file} - continuing without")
+ height = pd.DataFrame()
+ else:
+ height = read_parquet_df(str(height_file))
+ logger.info(f" Loaded {len(height)} teams from height data")
+
+ # Merge ratings with four factors on TeamName
+ if not ff.empty:
+ team_ratings = ratings.merge(
+ ff,
+ on="TeamName",
+ how="left",
+ suffixes=("_rating", "_ff"),
+ )
+ # Use four-factors values for AdjOE, AdjDE, AdjTempo (more complete)
+ # and ratings values for AdjEM, Luck, SOS
+ rename_dict = {
+ "TeamName": "kenpom_name",
+ "AdjEM": "adj_em", # From ratings (no suffix)
+ "Pythag": "pythag", # From ratings (no suffix) - Pythagorean win expectation
+ "Luck": "luck", # From ratings (no suffix)
+ "SOS": "sos", # From ratings (no suffix)
+ "AdjOE_ff": "adj_o", # From four-factors
+ "AdjDE_ff": "adj_d", # From four-factors
+ "AdjTempo_ff": "adj_t", # From four-factors
+ "eFG_Pct": "efg_pct", # From four-factors
+ "TO_Pct": "to_pct", # From four-factors
+ "OR_Pct": "or_pct", # From four-factors
+ "FT_Rate": "ft_rate", # From four-factors
+ }
+ else:
+ team_ratings = ratings.copy()
+ # No merge, use ratings columns directly
+ rename_dict = {
+ "TeamName": "kenpom_name",
+ "AdjEM": "adj_em",
+ "Pythag": "pythag", # Pythagorean win expectation
+ "AdjOE": "adj_o",
+ "AdjDE": "adj_d",
+ "AdjTempo": "adj_t",
+ "Luck": "luck",
+ "SOS": "sos",
+ }
+
+ # Rename columns to match staging schema
+ team_ratings = team_ratings.rename(columns=rename_dict)
+
+ # Load team mapping to add odds_api_name for merging
+ if team_mapping_path and team_mapping_path.exists():
+ logger.info(f" Loading team mapping from {team_mapping_path}")
+ team_mapping = read_parquet_df(str(team_mapping_path))
+
+ # Merge with team mapping to get odds_api_name
+ team_ratings = team_ratings.merge(
+ team_mapping[["kenpom_name", "odds_api_name"]],
+ on="kenpom_name",
+ how="left",
+ )
+
+ mapped_count = team_ratings["odds_api_name"].notna().sum()
+ logger.info(f" Mapped {mapped_count}/{len(team_ratings)} teams to Odds API names")
+ else:
+ logger.warning(" No team mapping provided - direct name matching may fail")
+ team_ratings["odds_api_name"] = team_ratings["kenpom_name"]
+
+ # Merge with height data
+ if not height.empty:
+ team_ratings = team_ratings.merge(
+ height[["TeamName", "HgtEff"]], # Use effective height (minutes-weighted)
+ left_on="kenpom_name",
+ right_on="TeamName",
+ how="left",
+ )
+ team_ratings = team_ratings.drop(columns=["TeamName"], errors="ignore")
+ team_ratings = team_ratings.rename(columns={"HgtEff": "height_eff"})
+ logger.info(f" Merged height data: {team_ratings['height_eff'].notna().sum()} teams")
+ else:
+ # Add null column if height data missing
+ team_ratings["height_eff"] = None
+
+ # Merge with HCA data
+ hca_file = kenpom_path / "hca" / "season" / f"hca_{season}.parquet"
+ if hca_file.exists():
+ hca = read_parquet_df(str(hca_file))
+ logger.info(f" Loaded {len(hca)} teams from HCA data")
+ # HCA table uses "Team" column; merge Pts (points HCA) and Elev (elevation)
+ hca_cols = ["Team"]
+ if "HCA" in hca.columns:
+ hca_cols.append("HCA")
+ if "Pts" in hca.columns:
+ hca_cols.append("Pts")
+ if "Elev" in hca.columns:
+ hca_cols.append("Elev")
+ team_ratings = team_ratings.merge(
+ hca[hca_cols],
+ left_on="kenpom_name",
+ right_on="Team",
+ how="left",
+ )
+ team_ratings = team_ratings.drop(columns=["Team"], errors="ignore")
+ team_ratings = team_ratings.rename(
+ columns={"HCA": "hca", "Pts": "hca_pts", "Elev": "elevation"}
+ )
+ logger.info(f" Merged HCA data: {team_ratings['hca'].notna().sum()} teams")
+ else:
+ logger.warning(f" HCA data not found: {hca_file} - continuing without")
+ team_ratings["hca"] = None
+ team_ratings["hca_pts"] = None
+ team_ratings["elevation"] = None
+
+ # Add season column
+ team_ratings["season"] = season
+
+ # Select only needed columns (including odds_api_name for merging)
+ desired_cols = [
+ "kenpom_name",
+ "odds_api_name",
+ "adj_em",
+ "pythag",
+ "adj_o",
+ "adj_d",
+ "adj_t",
+ "luck",
+ "sos",
+ "efg_pct",
+ "to_pct",
+ "or_pct",
+ "ft_rate",
+ "height_eff",
+ "hca",
+ "hca_pts",
+ "elevation",
+ "season",
+ ]
+
+ # Only keep columns that actually exist
+ available_cols = [col for col in desired_cols if col in team_ratings.columns]
+ team_ratings = team_ratings[available_cols]
+
+ logger.info(
+ f" Consolidated {len(team_ratings)} teams with {len(team_ratings.columns)} features"
+ )
+
+ if not dry_run:
+ # Write DataFrame directly using pandas
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ team_ratings.to_parquet(output_path, index=False)
+ logger.info(f" Wrote {output_path}")
+
+ return team_ratings
+
+
+def consolidate_fanmatch_predictions(
+ events: pd.DataFrame,
+ kenpom_path: Path,
+ output_path: Path,
+ season: int = 2026,
+ team_mapping_path: Path | None = None,
+ dry_run: bool = False,
+) -> pd.DataFrame | None:
+ """Match KenPom FanMatch predictions to staging events.
+
+ Reads cached fanmatch_season_{season}.parquet and matches to staging events
+ via team name mapping. Computes kp_predicted_margin and kp_predicted_total.
+
+ Args:
+ events: Events DataFrame with event_id, home_team, away_team, game_date
+ kenpom_path: Path to KenPom data directory
+ output_path: Path to write fanmatch_predictions.parquet
+ season: Season year
+ team_mapping_path: Path to team_mapping.parquet for name alignment
+ dry_run: If True, skip writing file
+
+ Returns:
+ FanMatch predictions DataFrame or None if data unavailable
+ """
+ logger.info("Consolidating FanMatch predictions...")
+
+ fanmatch_file = kenpom_path / "fanmatch" / f"fanmatch_season_{season}.parquet"
+ if not fanmatch_file.exists():
+ logger.warning(
+ f"FanMatch data not found: {fanmatch_file} - "
+ "Skipping FanMatch consolidation. "
+ "Run market_vs_kenpom_season.py to collect FanMatch data."
+ )
+ return None
+
+ fanmatch = read_parquet_df(str(fanmatch_file))
+ logger.info(f" Loaded {len(fanmatch)} FanMatch predictions")
+
+ # Load team mapping (kenpom_name -> odds_api_name)
+ if team_mapping_path and team_mapping_path.exists():
+ team_mapping = read_parquet_df(str(team_mapping_path))
+ kp_to_odds = dict(
+ zip(
+ team_mapping["kenpom_name"],
+ team_mapping["odds_api_name"],
+ strict=False,
+ )
+ )
+ logger.info(f" Loaded {len(kp_to_odds)} team name mappings")
+ else:
+ logger.warning(" No team mapping - FanMatch matching will be unreliable")
+ return None
+
+ # Map KenPom team names to odds_api names in fanmatch data
+ fanmatch["home_team_mapped"] = fanmatch["kp_home"].map(kp_to_odds)
+ fanmatch["away_team_mapped"] = fanmatch["kp_visitor"].map(kp_to_odds)
+
+ # Ensure game_date is string for matching
+ fanmatch["game_date"] = pd.to_datetime(fanmatch["game_date"]).dt.date
+ events_copy = events[["event_id", "home_team", "away_team", "game_date"]].copy()
+ events_copy["game_date"] = pd.to_datetime(events_copy["game_date"]).dt.date
+
+ # Match fanmatch to events on (home_team, away_team, game_date)
+ matched = events_copy.merge(
+ fanmatch[
+ [
+ "game_date",
+ "home_team_mapped",
+ "away_team_mapped",
+ "kp_predicted_margin",
+ "kp_predicted_total",
+ "kp_home_wp",
+ ]
+ ],
+ left_on=["home_team", "away_team", "game_date"],
+ right_on=["home_team_mapped", "away_team_mapped", "game_date"],
+ how="inner",
+ )
+
+ # Clean up merge columns
+ matched = matched.drop(columns=["home_team_mapped", "away_team_mapped"], errors="ignore")
+
+ # Select final columns
+ result = matched[["event_id", "kp_predicted_margin", "kp_predicted_total", "kp_home_wp"]].copy()
+
+ # Deduplicate (in case of multiple matches)
+ result = result.drop_duplicates(subset="event_id")
+
+ match_pct = len(result) / len(events) * 100 if len(events) > 0 else 0
+ logger.info(
+ f" Matched {len(result)}/{len(events)} events ({match_pct:.1f}%) with FanMatch predictions"
+ )
+
+ if not dry_run and len(result) > 0:
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ result.to_parquet(output_path, index=False)
+ logger.info(f" Wrote {output_path}")
+
+ return result
+
+
+def enrich_with_espn_context(
+ events: pd.DataFrame,
+ dry_run: bool = False,
+) -> pd.DataFrame:
+ """Enrich events with ESPN schedule context (neutral site, conference).
+
+ Loads all ESPN schedule parquets and matches to events on
+ (game_date, home_team, away_team). ESPN team names match odds_api format.
+
+ Args:
+ events: Events DataFrame with home_team, away_team, game_date
+ dry_run: If True, skip file operations
+
+ Returns:
+ Events DataFrame with context columns added
+ """
+ import glob
+
+ logger.info("Enriching events with ESPN context...")
+
+ espn_dir = settings.espn_data_dir / "schedule"
+ espn_files = sorted(glob.glob(str(espn_dir / "espn_schedule_*.parquet")))
+
+ if not espn_files:
+ logger.warning(f" No ESPN schedule files in {espn_dir} - skipping")
+ events["neutral_site"] = False
+ events["same_conference"] = False
+ events["is_conference_tournament"] = False
+ return events
+
+ # Load and combine all ESPN schedule files
+ espn_dfs = [read_parquet_df(f) for f in espn_files]
+ espn = pd.concat(espn_dfs, ignore_index=True)
+ logger.info(f" Loaded {len(espn)} ESPN schedule records from {len(espn_files)} files")
+
+ # Normalize dates
+ espn["game_date"] = pd.to_datetime(espn["game_date"]).dt.date
+ events["game_date_dt"] = pd.to_datetime(events["game_date"]).dt.date
+
+ # Select ESPN columns for merge
+ espn_cols = espn[
+ ["game_date", "home_team", "away_team", "neutral_site", "conference_name"]
+ ].copy()
+ espn_cols = espn_cols.drop_duplicates(subset=["game_date", "home_team", "away_team"])
+
+ # Merge on (game_date, home_team, away_team)
+ merged = events.merge(
+ espn_cols,
+ left_on=["game_date_dt", "home_team", "away_team"],
+ right_on=["game_date", "home_team", "away_team"],
+ how="left",
+ suffixes=("", "_espn"),
+ )
+
+ # Fill neutral_site: default False if not matched
+ if "neutral_site" not in merged.columns:
+ merged["neutral_site"] = False
+ merged["neutral_site"] = merged["neutral_site"].fillna(False).astype(bool)
+
+ # Derive same_conference flag:
+ # conference_name is only populated for conference games
+ merged["same_conference"] = merged["conference_name"].notna()
+
+ # Derive is_conference_tournament (March 8-15 window)
+ merged["is_conference_tournament"] = merged["game_date_dt"].apply(
+ lambda d: d.month == 3 and 8 <= d.day <= 15 if d else False
+ )
+
+ # Clean up
+ drop_cols = ["game_date_dt", "game_date_espn", "conference_name"]
+ merged = merged.drop(columns=[c for c in drop_cols if c in merged.columns])
+
+ neutral_count = merged["neutral_site"].sum()
+ conf_count = merged["same_conference"].sum()
+ logger.info(f" Neutral site games: {neutral_count}")
+ logger.info(f" Conference games: {conf_count}")
+
+ return merged
+
+
+def consolidate_action_network(
+ events: pd.DataFrame,
+ output_path: Path,
+ season: int = 2026,
+ dry_run: bool = False,
+) -> pd.DataFrame | None:
+ """Match Action Network betting features to staging events.
+
+ AN team names use the same format as odds_api_name, so direct matching
+ on (game_date, home_team, away_team) works without extra mapping.
+
+ Args:
+ events: Events DataFrame with event_id, home_team, away_team, game_date
+ output_path: Path to write an_features.parquet
+ season: Season year (used to find correct AN file)
+ dry_run: If True, skip writing file
+
+ Returns:
+ Action Network features DataFrame or None if unavailable
+ """
+ logger.info("Consolidating Action Network features...")
+
+ # AN file uses academic year start (2025 for 2025-26 season)
+ an_file = settings.action_network_data_dir / "features" / f"an_features_{season - 1}.parquet"
+ if not an_file.exists():
+ # Try current season naming convention too
+ an_file = settings.action_network_data_dir / "features" / f"an_features_{season}.parquet"
+ if not an_file.exists():
+ logger.warning("Action Network features not found - skipping")
+ return None
+
+ an_df = read_parquet_df(str(an_file))
+ logger.info(f" Loaded {len(an_df)} AN records")
+
+ # Normalize game_date for matching
+ an_df["game_date"] = pd.to_datetime(an_df["game_date"]).dt.date
+ events_copy = events[["event_id", "home_team", "away_team", "game_date"]].copy()
+ events_copy["game_date"] = pd.to_datetime(events_copy["game_date"]).dt.date
+
+ # Select useful AN columns for models
+ an_feature_cols = [
+ "game_date",
+ "home_team",
+ "away_team",
+ ]
+ # Add available feature columns
+ optional_cols = [
+ "spread_sharp_divergence",
+ "total_sharp_divergence",
+ "ml_sharp_divergence",
+ "spread_home_money_pct",
+ "spread_home_tickets_pct",
+ "total_over_money_pct",
+ "total_over_tickets_pct",
+ "num_bets",
+ "pinnacle_spread",
+ "pinnacle_total",
+ "consensus_spread",
+ "consensus_total",
+ "home_rank",
+ "away_rank",
+ ]
+ for col in optional_cols:
+ if col in an_df.columns:
+ an_feature_cols.append(col)
+
+ an_subset = an_df[an_feature_cols].copy()
+
+ # Match to events on (game_date, home_team, away_team)
+ matched = events_copy.merge(
+ an_subset,
+ on=["game_date", "home_team", "away_team"],
+ how="inner",
+ )
+
+ # Deduplicate
+ matched = matched.drop_duplicates(subset="event_id")
+
+ # Drop merge columns, keep event_id + features
+ result_cols = ["event_id"] + [c for c in matched.columns if c in optional_cols]
+ result = matched[result_cols].copy()
+
+ match_pct = len(result) / len(events) * 100 if len(events) > 0 else 0
+ logger.info(f" Matched {len(result)}/{len(events)} events ({match_pct:.1f}%) with AN features")
+
+ if not dry_run and len(result) > 0:
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ result.to_parquet(output_path, index=False)
+ logger.info(f" Wrote {output_path}")
+
+ return result
+
+
+def save_metadata(
+ staging_path: Path,
+ events_count: int,
+ scores_count: int,
+ features_count: int,
+ teams_count: int,
+ fanmatch_count: int = 0,
+ dry_run: bool = False,
+) -> None:
+ """Save staging metadata with build timestamp and coverage stats.
+
+ Args:
+ staging_path: Path to staging directory
+ events_count: Number of events in staging
+ scores_count: Number of events with scores
+ features_count: Number of events with line features
+ teams_count: Number of teams with ratings
+ dry_run: If True, skip writing file
+ """
+ # Use UTC timestamp without "T" and "Z" for metadata
+ built_at_utc = now_utc()
+
+ metadata = {
+ "built_at": built_at_utc.strftime("%Y-%m-%d %H:%M:%S UTC"),
+ "coverage": {
+ "events": events_count,
+ "with_scores": scores_count,
+ "with_line_features": features_count,
+ "teams_with_ratings": teams_count,
+ "with_fanmatch": fanmatch_count,
+ },
+ "feature_coverage_pct": (features_count / events_count * 100) if events_count > 0 else 0,
+ "score_coverage_pct": (scores_count / events_count * 100) if events_count > 0 else 0,
+ }
+
+ logger.info("\n=== Staging Metadata ===")
+ logger.info(f" Built at: {metadata['built_at']}")
+ logger.info(f" Events: {metadata['coverage']['events']}")
+ score_pct = metadata["score_coverage_pct"]
+ logger.info(f" With scores: {metadata['coverage']['with_scores']} ({score_pct:.1f}%)")
+ feature_pct = metadata["feature_coverage_pct"]
+ feature_count = metadata["coverage"]["with_line_features"]
+ logger.info(f" With features: {feature_count} ({feature_pct:.1f}%)")
+ logger.info(f" Teams: {metadata['coverage']['teams_with_ratings']}")
+
+ if not dry_run:
+ metadata_path = staging_path / "metadata.json"
+ with open(metadata_path, "w") as f:
+ json.dump(metadata, f, indent=2)
+ logger.info(f"\n[OK] Wrote metadata to {metadata_path}")
+
+
+def validate_staging(staging_path: Path) -> bool:
+ """Validate staging directory has all required files and reasonable data.
+
+ Args:
+ staging_path: Path to staging directory
+
+ Returns:
+ True if validation passes, False otherwise
+ """
+ logger.info("\n=== Validating Staging Data ===")
+
+ required_files = [
+ "events.parquet",
+ "line_features.parquet",
+ "metadata.json",
+ ]
+
+ optional_files = [
+ "team_ratings.parquet",
+ "fanmatch_predictions.parquet",
+ ]
+
+ all_valid = True
+
+ for filename in required_files:
+ filepath = staging_path / filename
+ if not filepath.exists():
+ logger.error(f" [ERROR] Missing file: {filename}")
+ all_valid = False
+ else:
+ logger.info(f" [OK] Found {filename}")
+
+ for filename in optional_files:
+ filepath = staging_path / filename
+ if not filepath.exists():
+ logger.warning(f" [WARNING] Optional file missing: {filename}")
+ else:
+ logger.info(f" [OK] Found {filename}")
+
+ if all_valid:
+ # Load and validate row counts
+ try:
+ events = read_parquet_df(str(staging_path / "events.parquet"))
+ line_features = read_parquet_df(str(staging_path / "line_features.parquet"))
+
+ logger.info("\nRow counts:")
+ logger.info(f" Events: {len(events)}")
+ logger.info(f" Line features: {len(line_features)}")
+
+ # Load team_ratings if available
+ team_ratings_path = staging_path / "team_ratings.parquet"
+ if team_ratings_path.exists():
+ team_ratings = read_parquet_df(str(team_ratings_path))
+ logger.info(f" Team ratings: {len(team_ratings)}")
+
+ if len(team_ratings) < 300:
+ logger.warning(f" [WARNING] Only {len(team_ratings)} teams - expected ~350")
+ else:
+ logger.info(" Team ratings: 0 (not available)")
+
+ # Validation checks
+ if len(line_features) > len(events):
+ logger.warning(" [WARNING] More line features than events - unexpected")
+
+ if len(events) == 0:
+ logger.error(" [ERROR] No events in staging")
+ all_valid = False
+
+ except Exception as e:
+ logger.error(f" [ERROR] Failed to validate staging data: {e}")
+ all_valid = False
+
+ if all_valid:
+ logger.info("\n[OK] Staging validation passed")
+ else:
+ logger.error("\n[ERROR] Staging validation failed")
+
+ return all_valid
+
+
+def main() -> None:
+ """Consolidate raw data into staging layer."""
+ parser = argparse.ArgumentParser(description="Consolidate raw data into ML-ready staging layer")
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Validate only, do not write files",
+ )
+ parser.add_argument(
+ "--force",
+ action="store_true",
+ help="Force rebuild even if staging is recent",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=settings.odds_api_db_path,
+ help="Path to Odds API SQLite database",
+ )
+ parser.add_argument(
+ "--kenpom-path",
+ type=Path,
+ default=settings.kenpom_api_data_dir,
+ help="Path to KenPom data directory",
+ )
+ parser.add_argument(
+ "--staging-path",
+ type=Path,
+ default=settings.staging_dir,
+ help="Path to staging output directory",
+ )
+ parser.add_argument(
+ "--season",
+ type=int,
+ default=2026,
+ help="KenPom season year",
+ )
+
+ args = parser.parse_args()
+
+ logger.info("=" * 80)
+ logger.info("STAGING LAYER CONSOLIDATION")
+ logger.info("=" * 80)
+
+ # Check if rebuild needed
+ if not args.force and not args.dry_run and not check_if_stale(args.staging_path):
+ logger.info("\n[OK] Staging is recent, no rebuild needed (use --force to rebuild anyway)")
+ return
+
+ # Create staging directory
+ args.staging_path.mkdir(parents=True, exist_ok=True)
+
+ # Initialize database connection
+ db = OddsAPIDatabase(args.db_path)
+
+ try:
+ # Step 1: Consolidate events with scores
+ events = consolidate_events(
+ db=db,
+ output_path=args.staging_path / "events.parquet",
+ dry_run=args.dry_run,
+ )
+
+ # Step 2: Consolidate line features
+ event_ids = events["event_id"].tolist()
+ line_features = consolidate_line_features(
+ db=db,
+ event_ids=event_ids,
+ output_path=args.staging_path / "line_features.parquet",
+ dry_run=args.dry_run,
+ )
+
+ # Step 3: Consolidate team ratings (with team mapping)
+ team_mapping_path = settings.team_mapping_path
+ team_ratings = consolidate_team_ratings(
+ kenpom_path=args.kenpom_path,
+ output_path=args.staging_path / "team_ratings.parquet",
+ season=args.season,
+ team_mapping_path=team_mapping_path if team_mapping_path.exists() else None,
+ dry_run=args.dry_run,
+ )
+
+ # Step 4: Consolidate FanMatch predictions
+ team_mapping_path_fm = settings.team_mapping_path
+ fanmatch_preds = consolidate_fanmatch_predictions(
+ events=events,
+ kenpom_path=args.kenpom_path,
+ output_path=args.staging_path / "fanmatch_predictions.parquet",
+ season=args.season,
+ team_mapping_path=(team_mapping_path_fm if team_mapping_path_fm.exists() else None),
+ dry_run=args.dry_run,
+ )
+
+ # Step 5: Consolidate Action Network features
+ consolidate_action_network(
+ events=events,
+ output_path=args.staging_path / "an_features.parquet",
+ season=args.season,
+ dry_run=args.dry_run,
+ )
+
+ # Step 6: Save metadata
+ save_metadata(
+ staging_path=args.staging_path,
+ events_count=len(events),
+ scores_count=len(events[events["home_score"].notna()]),
+ features_count=len(line_features),
+ teams_count=len(team_ratings) if team_ratings is not None else 0,
+ fanmatch_count=(len(fanmatch_preds) if fanmatch_preds is not None else 0),
+ dry_run=args.dry_run,
+ )
+
+ # Step 6: Validate
+ if not args.dry_run:
+ if validate_staging(args.staging_path):
+ logger.info("\n[OK] Staging consolidation complete!")
+ else:
+ logger.error("\n[ERROR] Staging validation failed - check logs above")
+ exit(1)
+ else:
+ logger.info("\n[DRY RUN] Would have consolidated staging data")
+
+ finally:
+ db.close()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/dedupe_opening_lines.py b/scripts/processing/dedupe_opening_lines.py
new file mode 100644
index 000000000..85f58b606
--- /dev/null
+++ b/scripts/processing/dedupe_opening_lines.py
@@ -0,0 +1,254 @@
+#!/usr/bin/env python3
+"""Deduplicate opening lines CSV while preserving line movements.
+
+Handles cases where same matchup appears in multiple categories:
+1. Exact duplicates (same time, same values) → Keep one
+2. Line movements (different values) → Keep all (these are valuable!)
+3. Same matchup, different capture times → Keep earliest as "true opening"
+
+Usage:
+ uv run python scripts/dedupe_opening_lines.py \\
+ data/ncaab_opening_line_20260206.csv
+ uv run python scripts/dedupe_opening_lines.py \\
+ data/ncaab_opening_line_20260206.csv --output data/deduped.csv
+ uv run python scripts/dedupe_opening_lines.py \\
+ data/ncaab_opening_line_20260206.csv --keep-movements
+ uv run python scripts/dedupe_opening_lines.py \\
+ data/ncaab_opening_line_20260206.csv --prefer-category "College Basketball"
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import write_csv
+
+logger = logging.getLogger(__name__)
+
+
+def has_line_movement(group: pd.DataFrame) -> bool:
+ """Check if a matchup group has any line movements.
+
+ Args:
+ group: DataFrame rows for same matchup
+
+ Returns:
+ True if spread, total, or prices changed
+ """
+ if len(group) <= 1:
+ return False
+
+ # Check if any betting values changed
+ spread_changed = group["spread_magnitude"].nunique() > 1
+ total_changed = group["total_points"].nunique() > 1
+ fav_price_changed = group["spread_favorite_price"].nunique() > 1
+ dog_price_changed = group["spread_underdog_price"].nunique() > 1
+ over_price_changed = group["total_over_price"].nunique() > 1
+ under_price_changed = group["total_under_price"].nunique() > 1
+
+ return (
+ spread_changed
+ or total_changed
+ or fav_price_changed
+ or dog_price_changed
+ or over_price_changed
+ or under_price_changed
+ )
+
+
+def dedupe_opening_lines(
+ df: pd.DataFrame,
+ keep_movements: bool = True,
+ prefer_category: str | None = None,
+) -> pd.DataFrame:
+ """Deduplicate opening lines while preserving line movements.
+
+ Args:
+ df: Opening lines DataFrame
+ keep_movements: If True, keep all rows where lines changed (default: True)
+ prefer_category: When deduping exact duplicates, prefer this category
+ ("College Basketball" or "College Extra")
+
+ Returns:
+ Deduplicated DataFrame
+ """
+ df = df.copy()
+ df["opened_at_dt"] = pd.to_datetime(df["opened_at"])
+ df["matchup_key"] = df["away_team"] + " @ " + df["home_team"]
+
+ # Track deduplication stats
+ original_count = len(df)
+ exact_duplicates_removed = 0
+ movements_kept = 0
+
+ result_rows = []
+
+ for matchup, group in df.groupby("matchup_key"):
+ if len(group) == 1:
+ # No duplicates for this matchup
+ result_rows.append(group.iloc[0])
+ continue
+
+ # Check for line movements
+ if keep_movements and has_line_movement(group):
+ # Keep all rows - these represent line movements over time
+ result_rows.extend([row for _, row in group.iterrows()])
+ movements_kept += len(group)
+ logger.debug("Kept %d rows for %s (line movement detected)", len(group), matchup)
+ continue
+
+ # Exact duplicates - keep one based on strategy
+ if prefer_category and prefer_category in group["category"].values:
+ # Prefer specified category
+ preferred = group[group["category"] == prefer_category].iloc[0]
+ result_rows.append(preferred)
+ exact_duplicates_removed += len(group) - 1
+ logger.debug(
+ "Kept %s category for %s (removed %d duplicates)",
+ prefer_category,
+ matchup,
+ len(group) - 1,
+ )
+ else:
+ # Keep earliest capture
+ earliest = group.loc[group["opened_at_dt"].idxmin()]
+ result_rows.append(earliest)
+ exact_duplicates_removed += len(group) - 1
+ logger.debug(
+ "Kept earliest capture for %s (removed %d duplicates)",
+ matchup,
+ len(group) - 1,
+ )
+
+ result_df = pd.DataFrame(result_rows)
+ result_df = result_df.drop(columns=["opened_at_dt", "matchup_key"])
+ result_df = result_df.sort_values("opened_at", ascending=False).reset_index(drop=True)
+
+ # Log summary
+ final_count = len(result_df)
+ logger.info("Deduplication complete:")
+ logger.info(" Original rows: %d", original_count)
+ logger.info(" Final rows: %d", final_count)
+ logger.info(" Exact duplicates removed: %d", exact_duplicates_removed)
+ if keep_movements:
+ logger.info(" Line movements kept: %d rows", movements_kept)
+ logger.info(
+ " Reduction: %d rows (%.1f%%)",
+ original_count - final_count,
+ 100 * (original_count - final_count) / original_count,
+ )
+
+ return result_df
+
+
+def print_dedup_summary(original_df: pd.DataFrame, deduped_df: pd.DataFrame) -> None:
+ """Print summary of deduplication results.
+
+ Args:
+ original_df: Original DataFrame
+ deduped_df: Deduplicated DataFrame
+ """
+ original_df["matchup_key"] = original_df["away_team"] + " @ " + original_df["home_team"]
+ deduped_df["matchup_key"] = deduped_df["away_team"] + " @ " + deduped_df["home_team"]
+
+ print("\n=== Deduplication Summary ===")
+ print(f"Original rows: {len(original_df)}")
+ print(f"Deduplicated rows: {len(deduped_df)}")
+ print(f"Removed: {len(original_df) - len(deduped_df)} rows")
+ print()
+ print(f"Original unique matchups: {original_df['matchup_key'].nunique()}")
+ print(f"Final unique matchups: {deduped_df['matchup_key'].nunique()}")
+ print()
+
+ # Find matchups with multiple rows in deduped (line movements)
+ movements = deduped_df[deduped_df["matchup_key"].duplicated(keep=False)]
+ if len(movements) > 0:
+ print(f"Matchups with line movements preserved: {movements['matchup_key'].nunique()}")
+ print("\nSample line movements:")
+ print("-" * 80)
+ for matchup in movements["matchup_key"].unique()[:3]:
+ games = movements[movements["matchup_key"] == matchup].sort_values("opened_at")
+ print(f"\n{matchup}")
+ for _, row in games.iterrows():
+ print(
+ f" {row['opened_at']} | {row['category']:20s} | "
+ f"Spread: {row['spread_magnitude']:4.1f} | Total: {row['total_points']:5.1f}"
+ )
+
+
+def main() -> None:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(
+ description="Deduplicate opening lines CSV while preserving line movements"
+ )
+ parser.add_argument("input_file", type=Path, help="Input CSV file path")
+ parser.add_argument(
+ "--output",
+ "-o",
+ type=Path,
+ help="Output CSV file path (default: input_file with _deduped suffix)",
+ )
+ parser.add_argument(
+ "--keep-movements",
+ action="store_true",
+ default=True,
+ help="Keep all rows where lines changed (default: True)",
+ )
+ parser.add_argument(
+ "--no-keep-movements",
+ action="store_false",
+ dest="keep_movements",
+ help="Remove all duplicates, even if lines changed",
+ )
+ parser.add_argument(
+ "--prefer-category",
+ choices=["College Basketball", "College Extra"],
+ help="When deduping exact duplicates, prefer this category",
+ )
+ parser.add_argument("--verbose", "-v", action="store_true", help="Enable debug logging")
+
+ args = parser.parse_args()
+
+ # Configure logging
+ logging.basicConfig(
+ level=logging.DEBUG if args.verbose else logging.INFO,
+ format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+ )
+
+ # Read input
+ if not args.input_file.exists():
+ logger.error("Input file not found: %s", args.input_file)
+ return
+
+ df = pd.read_csv(args.input_file)
+ logger.info("Loaded %d rows from %s", len(df), args.input_file)
+
+ # Deduplicate
+ deduped_df = dedupe_opening_lines(
+ df,
+ keep_movements=args.keep_movements,
+ prefer_category=args.prefer_category,
+ )
+
+ # Determine output path
+ if args.output:
+ output_path = args.output
+ else:
+ stem = args.input_file.stem
+ suffix = args.input_file.suffix
+ output_path = args.input_file.parent / f"{stem}_deduped{suffix}"
+
+ # Write output
+ write_csv(deduped_df, output_path, index=False)
+ logger.info("Wrote %d rows to %s", len(deduped_df), output_path)
+
+ # Print summary
+ print_dedup_summary(df, deduped_df)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/deduplicate_events.py b/scripts/processing/deduplicate_events.py
new file mode 100644
index 000000000..2cd410ed5
--- /dev/null
+++ b/scripts/processing/deduplicate_events.py
@@ -0,0 +1,316 @@
+"""Deduplicate events table and consolidate observations.
+
+This script fixes the event duplication issue where the same game exists
+under multiple event_ids due to:
+1. Team name variations across data sources
+2. Multiple collection sources (ESPN, Odds API)
+3. Datetime format inconsistencies
+
+Strategy:
+1. Group events by normalized (home_team, away_team, game_date)
+2. Select canonical event_id (prefer ones with observations)
+3. Migrate observations to canonical event_id
+4. Delete duplicate events
+5. Update foreign keys in related tables
+
+Usage:
+ uv run python scripts/processing/deduplicate_events.py --dry-run
+ uv run python scripts/processing/deduplicate_events.py # Execute
+"""
+
+from __future__ import annotations
+
+import argparse
+import hashlib
+import logging
+from collections import defaultdict
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def normalize_team_name(name: str) -> str:
+ """Normalize team name for consistent matching.
+
+ Args:
+ name: Raw team name from any source
+
+ Returns:
+ Normalized team name
+ """
+ # Load team mapping if available
+ mapping_path = Path("data/staging/mappings/team_mapping.parquet")
+ if mapping_path.exists():
+ try:
+ mapping_df = read_parquet_df(str(mapping_path))
+ # Try to find this team in any column
+ for col in ["odds_api_name", "espn_name", "kenpom_name"]:
+ if col in mapping_df.columns:
+ match = mapping_df[mapping_df[col] == name]
+ if len(match) > 0:
+ # Return canonical name (odds_api_name)
+ return str(match.iloc[0]["odds_api_name"])
+ except Exception as e:
+ logger.warning(f"Could not load team mapping: {e}")
+
+ # Fallback: Basic normalization
+ return name.strip()
+
+
+def normalize_datetime(dt_str: str) -> str:
+ """Normalize datetime string to consistent format.
+
+ Args:
+ dt_str: Datetime string in any format (ISO8601, etc.)
+
+ Returns:
+ Normalized datetime string: YYYY-MM-DD HH:MM:SS
+ """
+ # Remove T and Z
+ dt_str = dt_str.replace("T", " ").replace("Z", "").strip()
+
+ # Parse and reformat
+ try:
+ dt = datetime.fromisoformat(dt_str)
+ return dt.strftime("%Y-%m-%d %H:%M:%S")
+ except Exception:
+ # If parsing fails, return as-is
+ return dt_str
+
+
+def generate_canonical_key(home_team: str, away_team: str, commence_time: str) -> str:
+ """Generate canonical key for grouping duplicate events.
+
+ Args:
+ home_team: Home team name
+ away_team: Away team name
+ commence_time: Game start time
+
+ Returns:
+ Canonical key (hash of normalized values)
+ """
+ # Normalize teams
+ home_norm = normalize_team_name(home_team)
+ away_norm = normalize_team_name(away_team)
+
+ # Normalize datetime to date only (games on same day are same game)
+ dt_norm = normalize_datetime(commence_time)
+ game_date = dt_norm.split(" ")[0] # Extract date part
+
+ # Sort teams alphabetically (home/away designation might differ)
+ teams_sorted = tuple(sorted([home_norm, away_norm]))
+
+ # Create canonical key
+ key_str = f"{teams_sorted[0]}|{teams_sorted[1]}|{game_date}"
+ return hashlib.sha256(key_str.encode()).hexdigest()[:16]
+
+
+def find_duplicates(db: OddsAPIDatabase) -> dict[str, list[str]]:
+ """Find duplicate events grouped by canonical key.
+
+ Args:
+ db: Database connection
+
+ Returns:
+ Dict mapping canonical_key -> list of event_ids
+ """
+ query = """
+ SELECT
+ event_id,
+ home_team,
+ away_team,
+ commence_time,
+ source
+ FROM events
+ """
+
+ events_df = pd.read_sql_query(query, db.conn)
+ logger.info(f"Loaded {len(events_df)} total events")
+
+ # Group by canonical key
+ groups: dict[str, list[str]] = defaultdict(list)
+
+ for _, row in events_df.iterrows():
+ canonical_key = generate_canonical_key(
+ row["home_team"], row["away_team"], row["commence_time"]
+ )
+ groups[canonical_key].append(row["event_id"])
+
+ # Filter to only duplicates
+ duplicates = {k: v for k, v in groups.items() if len(v) > 1}
+
+ logger.info(f"Found {len(duplicates)} duplicate event groups")
+ logger.info(f"Total duplicate events: {sum(len(v) for v in duplicates.values())}")
+
+ return duplicates
+
+
+def select_canonical_event(event_ids: list[str], db: OddsAPIDatabase) -> str:
+ """Select the canonical event_id from a group of duplicates.
+
+ Priority:
+ 1. Event with most observations
+ 2. Event with source='odds_api' (UUID format)
+ 3. Event with source='espn'
+ 4. Alphabetically first
+
+ Args:
+ event_ids: List of duplicate event IDs
+ db: Database connection
+
+ Returns:
+ Canonical event_id
+ """
+ # Get observation counts
+ obs_counts = {}
+ for eid in event_ids:
+ query = f"SELECT COUNT(*) as cnt FROM observations WHERE event_id = '{eid}'"
+ result = pd.read_sql_query(query, db.conn)
+ obs_counts[eid] = result.iloc[0]["cnt"]
+
+ # Select event with most observations
+ if max(obs_counts.values()) > 0:
+ canonical = max(obs_counts, key=obs_counts.get) # type: ignore
+ logger.debug(f"Selected {canonical} (has {obs_counts[canonical]} observations)")
+ return canonical
+
+ # Fallback: prefer odds_api source, then espn, then alphabetically
+ query = f"""
+ SELECT event_id, source
+ FROM events
+ WHERE event_id IN ({",".join(["?"] * len(event_ids))})
+ """
+ events_df = pd.read_sql_query(query, db.conn, params=event_ids)
+
+ # Prefer odds_api
+ odds_api_events = events_df[events_df["source"] == "odds_api"]["event_id"].tolist()
+ if odds_api_events:
+ return odds_api_events[0]
+
+ # Prefer espn
+ espn_events = events_df[events_df["source"] == "espn"]["event_id"].tolist()
+ if espn_events:
+ return espn_events[0]
+
+ # Fallback: alphabetically first
+ return sorted(event_ids)[0]
+
+
+def deduplicate_events(db: OddsAPIDatabase, dry_run: bool = False) -> None:
+ """Deduplicate events and consolidate observations.
+
+ Args:
+ db: Database connection
+ dry_run: If True, only show what would be done
+ """
+ logger.info("Starting event deduplication...")
+
+ # Find duplicates
+ duplicates = find_duplicates(db)
+
+ if len(duplicates) == 0:
+ logger.info("No duplicates found!")
+ return
+
+ # Process each group
+ total_merged = 0
+ total_deleted = 0
+
+ for canonical_key, event_ids in duplicates.items():
+ # Select canonical event
+ canonical_id = select_canonical_event(event_ids, db)
+ other_ids = [eid for eid in event_ids if eid != canonical_id]
+
+ logger.info(f"\nCanonical key: {canonical_key}")
+ logger.info(f" Canonical event: {canonical_id}")
+ logger.info(f" Duplicate events: {other_ids}")
+
+ if dry_run:
+ logger.info(" [DRY RUN] Would merge observations and delete duplicates")
+ continue
+
+ # Migrate observations from duplicates to canonical
+ for dup_id in other_ids:
+ query = """
+ UPDATE observations
+ SET event_id = ?
+ WHERE event_id = ?
+ """
+ db.conn.execute(query, (canonical_id, dup_id))
+ logger.info(f" Migrated observations from {dup_id} to {canonical_id}")
+
+ # Migrate scores from duplicates to canonical
+ for dup_id in other_ids:
+ # Check if canonical already has scores
+ check_query = "SELECT COUNT(*) FROM scores WHERE event_id = ?"
+ has_scores = db.conn.execute(check_query, (canonical_id,)).fetchone()[0] > 0
+
+ if not has_scores:
+ # Migrate scores from duplicate
+ query = """
+ UPDATE scores
+ SET event_id = ?
+ WHERE event_id = ?
+ """
+ db.conn.execute(query, (canonical_id, dup_id))
+ logger.info(f" Migrated scores from {dup_id} to {canonical_id}")
+ else:
+ # Delete duplicate scores (canonical already has them)
+ query = "DELETE FROM scores WHERE event_id = ?"
+ db.conn.execute(query, (dup_id,))
+ logger.info(f" Deleted duplicate scores for {dup_id}")
+
+ # Delete duplicate events
+ for dup_id in other_ids:
+ query = "DELETE FROM events WHERE event_id = ?"
+ db.conn.execute(query, (dup_id,))
+ total_deleted += 1
+
+ db.conn.commit()
+ total_merged += len(other_ids)
+
+ logger.info("\n=== SUMMARY ===")
+ logger.info(f"Duplicate groups processed: {len(duplicates)}")
+ logger.info(f"Events merged: {total_merged}")
+ logger.info(f"Events deleted: {total_deleted}")
+
+ if not dry_run:
+ # Vacuum to reclaim space
+ logger.info("Running VACUUM to reclaim space...")
+ db.conn.execute("VACUUM")
+ logger.info("[OK] Deduplication complete!")
+
+
+def main() -> None:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(description="Deduplicate events table")
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Show what would be done without making changes",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=str,
+ default="data/odds_api/odds_api.sqlite3",
+ help="Path to SQLite database",
+ )
+
+ args = parser.parse_args()
+
+ # Connect to database
+ db = OddsAPIDatabase(args.db_path)
+
+ # Run deduplication
+ deduplicate_events(db, dry_run=args.dry_run)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/map_espn_teams.py b/scripts/processing/map_espn_teams.py
new file mode 100644
index 000000000..ad07b0f37
--- /dev/null
+++ b/scripts/processing/map_espn_teams.py
@@ -0,0 +1,278 @@
+"""Map ESPN team names to canonical team mapping.
+
+This script matches ESPN team data to the canonical KenPom-based team mapping
+table using fuzzy string matching and manual mappings for edge cases.
+
+Usage:
+ uv run python scripts/map_espn_teams.py
+
+Output:
+ Updates data/processed/team_mapping.parquet with ESPN fields
+"""
+
+import logging
+from pathlib import Path
+
+import pandas as pd
+from thefuzz import fuzz
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df, write_parquet
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+# Manual mappings for teams with name variations
+MANUAL_ESPN_MAPPINGS = {
+ # ESPN display_name -> KenPom TeamName
+ "American University Eagles": "American",
+ "Arizona State Sun Devils": "Arizona St.",
+ "Arkansas Razorbacks": "Arkansas",
+ "Bellarmine Knights": "Bellarmine",
+ "Boise State Broncos": "Boise St.",
+ "Bradley Braves": "Bradley",
+ "Cal Poly Mustangs": "Cal Poly",
+ "California Golden Bears": "California",
+ "Central Florida Knights": "UCF",
+ "Colorado Buffaloes": "Colorado",
+ "Colorado State Rams": "Colorado St.",
+ "Delaware Blue Hens": "Delaware",
+ "Florida A&M Rattlers": "Florida A&M",
+ "Florida Gators": "Florida",
+ "Florida State Seminoles": "Florida St.",
+ "George Washington Revolutionaries": "George Washington",
+ "Georgetown Hoyas": "Georgetown",
+ "Georgia Tech Yellow Jackets": "Georgia Tech",
+ "Hawai'i Rainbow Warriors": "Hawaii",
+ "Howard Bison": "Howard",
+ "Idaho Vandals": "Idaho",
+ "Indiana Hoosiers": "Indiana",
+ "Iowa State Cyclones": "Iowa St.",
+ "IU Indianapolis Jaguars": "IUPUI",
+ "Jacksonville State Gamecocks": "Jacksonville St.",
+ "Louisville Cardinals": "Louisville",
+ "LSU Tigers": "Louisiana St.",
+ "Miami Hurricanes": "Miami FL",
+ "Murray State Racers": "Murray St.",
+ "NC State Wolfpack": "N.C. State",
+ "Notre Dame Fighting Irish": "Notre Dame",
+ "Ole Miss Rebels": "Mississippi",
+ "Pitt Panthers": "Pittsburgh",
+ "Sacramento State Hornets": "Sacramento St.",
+ "Saint Joseph's Hawks": "Saint Joseph's",
+ "Saint Louis Billikens": "Saint Louis",
+ "San Diego State Aztecs": "San Diego St.",
+ "San José State Spartans": "San Jose St.",
+ "SMU Mustangs": "SMU",
+ "South Alabama Jaguars": "South Alabama",
+ "South Florida Bulls": "South Florida",
+ "Southern California Trojans": "USC",
+ "Southern Illinois Salukis": "Southern Illinois",
+ "Stanford Cardinal": "Stanford",
+ "Stetson Hatters": "Stetson",
+ "TCU Horned Frogs": "TCU",
+ "UAB Blazers": "UAB",
+ "UC Riverside Highlanders": "UC Riverside",
+ "UC San Diego Tritons": "UC San Diego",
+ "UCF Knights": "UCF",
+ "UCLA Bruins": "UCLA",
+ "UConn Huskies": "Connecticut",
+ "UIC Flames": "UIC",
+ "UNLV Rebels": "UNLV",
+ "USC Trojans": "USC",
+ "VCU Rams": "VCU",
+ "Virginia Tech Hokies": "Virginia Tech",
+ "Western Kentucky Hilltoppers": "Western Kentucky",
+}
+
+
+def load_team_mapping() -> pd.DataFrame:
+ """Load the canonical team mapping table.
+
+ Returns:
+ Team mapping DataFrame
+ """
+ mapping_path = Path("data/processed/team_mapping.parquet")
+ if not mapping_path.exists():
+ raise FileNotFoundError(
+ f"Team mapping not found: {mapping_path}. Run build_team_mapping.py first."
+ )
+
+ df = read_parquet_df(str(mapping_path))
+ logger.info(f"Loaded team mapping with {len(df)} teams")
+ return df
+
+
+def load_espn_teams(season: int = 2026) -> pd.DataFrame:
+ """Load ESPN team data.
+
+ Args:
+ season: The season year
+
+ Returns:
+ ESPN teams DataFrame
+ """
+ espn_path = Path(f"data/espn/teams/espn_team_names_{season}.parquet")
+ if not espn_path.exists():
+ raise FileNotFoundError(f"ESPN team data not found: {espn_path}")
+
+ df = read_parquet_df(str(espn_path))
+ logger.info(f"Loaded {len(df)} teams from ESPN")
+ return df
+
+
+def fuzzy_match_team(espn_name: str, kenpom_names: list[str], threshold: int = 85) -> str | None:
+ """Find best matching KenPom team name using fuzzy string matching.
+
+ Args:
+ espn_name: ESPN team display name
+ kenpom_names: List of KenPom team names
+ threshold: Minimum similarity score (0-100)
+
+ Returns:
+ Best matching KenPom name or None if no good match
+ """
+ # Extract core team name (remove mascot/common suffixes)
+ espn_core = (
+ espn_name.replace(" University", "")
+ .replace(" State", "")
+ .replace(" College", "")
+ .replace(" Eagles", "")
+ .replace(" Wildcats", "")
+ .replace(" Tigers", "")
+ .replace(" Bears", "")
+ .replace(" Bulldogs", "")
+ .strip()
+ )
+
+ best_score = 0
+ best_match = None
+
+ for kenpom_name in kenpom_names:
+ # Compare full names
+ score_full = fuzz.ratio(espn_name.lower(), kenpom_name.lower())
+
+ # Compare core names
+ kenpom_core = kenpom_name.replace(" St.", "").replace(" A&M", "").strip()
+ score_core = fuzz.ratio(espn_core.lower(), kenpom_core.lower())
+
+ # Use best score
+ score = max(score_full, score_core)
+
+ if score > best_score:
+ best_score = score
+ best_match = kenpom_name
+
+ if best_score >= threshold:
+ return best_match
+ return None
+
+
+def map_espn_to_canonical(mapping_df: pd.DataFrame, espn_df: pd.DataFrame) -> pd.DataFrame:
+ """Map ESPN teams to canonical team mapping.
+
+ Args:
+ mapping_df: Canonical team mapping DataFrame
+ espn_df: ESPN teams DataFrame
+
+ Returns:
+ Updated mapping DataFrame with ESPN fields populated
+ """
+ kenpom_names = mapping_df["kenpom_name"].tolist()
+ matches = []
+ unmatched = []
+
+ for _, espn_row in espn_df.iterrows():
+ espn_name = espn_row["display_name"]
+
+ # Try manual mapping first
+ if espn_name in MANUAL_ESPN_MAPPINGS:
+ kenpom_match = MANUAL_ESPN_MAPPINGS[espn_name]
+ match_type = "manual"
+ else:
+ # Try fuzzy matching
+ kenpom_match = fuzzy_match_team(espn_name, kenpom_names)
+ match_type = "fuzzy" if kenpom_match else None
+
+ if kenpom_match:
+ matches.append(
+ {
+ "kenpom_name": kenpom_match,
+ "espn_id": espn_row["team_id"],
+ "espn_display_name": espn_row["display_name"],
+ "espn_abbreviation": espn_row["abbreviation"],
+ "espn_slug": espn_row["slug"],
+ "match_type": match_type,
+ }
+ )
+ else:
+ unmatched.append(espn_name)
+
+ logger.info(f"Matched {len(matches)} ESPN teams")
+ logger.info(f" Manual matches: {sum(1 for m in matches if m['match_type'] == 'manual')}")
+ logger.info(f" Fuzzy matches: {sum(1 for m in matches if m['match_type'] == 'fuzzy')}")
+
+ if unmatched:
+ logger.warning(f"Unmatched ESPN teams ({len(unmatched)}): {unmatched}")
+
+ # Update mapping DataFrame
+ matches_df = pd.DataFrame(matches)
+ mapping_df = mapping_df.merge(
+ matches_df[
+ [
+ "kenpom_name",
+ "espn_id",
+ "espn_display_name",
+ "espn_abbreviation",
+ "espn_slug",
+ ]
+ ],
+ on="kenpom_name",
+ how="left",
+ suffixes=("_old", ""),
+ )
+
+ # Drop old columns and rename
+ cols_to_drop = [c for c in mapping_df.columns if c.endswith("_old")]
+ mapping_df = mapping_df.drop(columns=cols_to_drop)
+
+ return mapping_df
+
+
+def save_team_mapping(mapping_df: pd.DataFrame, output_path: Path) -> None:
+ """Save updated team mapping table.
+
+ Args:
+ mapping_df: Team mapping DataFrame
+ output_path: Path to save the parquet file
+ """
+ write_parquet(mapping_df, str(output_path), index=False)
+ logger.info(f"Saved updated team mapping to {output_path}")
+
+
+def main() -> None:
+ """Map ESPN teams to canonical mapping."""
+ logger.info("Starting ESPN team mapping...")
+
+ # Load data
+ mapping_df = load_team_mapping()
+ espn_df = load_espn_teams(season=2026)
+
+ # Map ESPN to canonical
+ mapping_df = map_espn_to_canonical(mapping_df, espn_df)
+
+ # Summary
+ espn_mapped = mapping_df["espn_id"].notna().sum()
+ logger.info("\nESPN Mapping Summary:")
+ logger.info(f" Teams with ESPN data: {espn_mapped} / {len(mapping_df)}")
+ logger.info(f" Coverage: {espn_mapped / len(mapping_df) * 100:.1f}%")
+
+ # Save updated mapping
+ output_path = Path("data/processed/team_mapping.parquet")
+ save_team_mapping(mapping_df, output_path)
+
+ logger.info("\nNext step: Run python scripts/map_overtime_teams.py")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/map_odds_api_teams.py b/scripts/processing/map_odds_api_teams.py
new file mode 100644
index 000000000..d4d14730d
--- /dev/null
+++ b/scripts/processing/map_odds_api_teams.py
@@ -0,0 +1,372 @@
+"""Map The Odds API team names to canonical team mapping.
+
+This script matches The Odds API team names to the canonical KenPom-based
+team mapping table. The Odds API uses full team names with mascots.
+
+Usage:
+ uv run python scripts/map_odds_api_teams.py
+
+Output:
+ Updates data/processed/team_mapping.parquet with Odds API fields
+"""
+
+import logging
+from pathlib import Path
+
+import pandas as pd
+from thefuzz import fuzz
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df, write_parquet
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+# Manual mappings for Odds API name variations
+MANUAL_ODDS_API_MAPPINGS = {
+ # Odds API name -> KenPom TeamName
+ "Alabama Crimson Tide": "Alabama",
+ "Cleveland St Vikings": "Cleveland St.",
+ "Colorado Buffaloes": "Colorado",
+ "East Tennessee St Buccaneers": "East Tennessee St.",
+ "Florida Atlantic Owls": "Florida Atlantic",
+ "Florida Gators": "Florida",
+ "Green Bay Phoenix": "Green Bay",
+ "Illinois Fighting Illini": "Illinois",
+ "Iowa Hawkeyes": "Iowa",
+ "Iowa State Cyclones": "Iowa St.",
+ "Kansas St Wildcats": "Kansas St.",
+ "Maryland Terrapins": "Maryland",
+ "Memphis Tigers": "Memphis",
+ "Minnesota Golden Gophers": "Minnesota",
+ "Mt. St. Mary's Mountaineers": "Mount St. Mary's",
+ "Nebraska Cornhuskers": "Nebraska",
+ "Northern Kentucky Norse": "Northern Kentucky",
+ "Oregon Ducks": "Oregon",
+ "Penn State Nittany Lions": "Penn St.",
+ "Purdue Boilermakers": "Purdue",
+ "St. Thomas (MN) Tommies": "St. Thomas",
+ "TCU Horned Frogs": "TCU",
+ "UL Monroe Warhawks": "Louisiana Monroe",
+ "Western Carolina Catamounts": "Western Carolina",
+ "Wichita St Shockers": "Wichita St.",
+ "Wright St Raiders": "Wright St.",
+ # Additional common patterns
+ "Alabama A&M Bulldogs": "Alabama A&M",
+ "Alabama St Hornets": "Alabama St.",
+ "Appalachian State Mountaineers": "Appalachian St.",
+ "Arizona State Sun Devils": "Arizona St.",
+ "Arizona Wildcats": "Arizona",
+ "Arkansas Razorbacks": "Arkansas",
+ "Arkansas State Red Wolves": "Arkansas St.",
+ "Auburn Tigers": "Auburn",
+ "Ball State Cardinals": "Ball St.",
+ "Baylor Bears": "Baylor",
+ "Boise State Broncos": "Boise St.",
+ "Boston College Eagles": "Boston College",
+ "Bowling Green Falcons": "Bowling Green",
+ "BYU Cougars": "BYU",
+ "California Golden Bears": "California",
+ "Cincinnati Bearcats": "Cincinnati",
+ "Clemson Tigers": "Clemson",
+ "Colorado State Rams": "Colorado St.",
+ "Connecticut Huskies": "Connecticut",
+ "Duke Blue Devils": "Duke",
+ "Florida State Seminoles": "Florida St.",
+ "Fresno State Bulldogs": "Fresno St.",
+ "Georgia Bulldogs": "Georgia",
+ "Georgia State Panthers": "Georgia St.",
+ "Georgia Tech Yellow Jackets": "Georgia Tech",
+ "Gonzaga Bulldogs": "Gonzaga",
+ "Houston Cougars": "Houston",
+ "Indiana Hoosiers": "Indiana",
+ "Indiana State Sycamores": "Indiana St.",
+ "Kansas Jayhawks": "Kansas",
+ "Kansas State Wildcats": "Kansas St.",
+ "Kent State Golden Flashes": "Kent St.",
+ "Kentucky Wildcats": "Kentucky",
+ "Louisiana State Tigers": "Louisiana St.",
+ "Louisville Cardinals": "Louisville",
+ "LSU Tigers": "Louisiana St.",
+ "Marquette Golden Eagles": "Marquette",
+ "Miami (FL) Hurricanes": "Miami FL",
+ "Miami Hurricanes": "Miami FL",
+ "Miami (OH) RedHawks": "Miami OH",
+ "Michigan Wolverines": "Michigan",
+ "Michigan State Spartans": "Michigan St.",
+ "Middle Tennessee Blue Raiders": "Middle Tennessee",
+ "Mississippi State Bulldogs": "Mississippi St.",
+ "Missouri Tigers": "Missouri",
+ "Missouri State Bears": "Missouri St.",
+ "Montana State Bobcats": "Montana St.",
+ "Murray State Racers": "Murray St.",
+ "NC State Wolfpack": "N.C. State",
+ "North Carolina State Wolfpack": "N.C. State",
+ "North Carolina Tar Heels": "North Carolina",
+ "North Dakota State Bison": "North Dakota St.",
+ "Notre Dame Fighting Irish": "Notre Dame",
+ "Ohio State Buckeyes": "Ohio St.",
+ "Oklahoma State Cowboys": "Oklahoma St.",
+ "Ole Miss Rebels": "Mississippi",
+ "Oregon State Beavers": "Oregon St.",
+ "Pitt Panthers": "Pittsburgh",
+ "Pittsburgh Panthers": "Pittsburgh",
+ "San Diego State Aztecs": "San Diego St.",
+ "San Jose State Spartans": "San Jose St.",
+ "South Carolina Gamecocks": "South Carolina",
+ "South Carolina State Bulldogs": "South Carolina St.",
+ "South Dakota State Jackrabbits": "South Dakota St.",
+ "South Florida Bulls": "South Florida",
+ "Southern California Trojans": "USC",
+ "Stanford Cardinal": "Stanford",
+ "Syracuse Orange": "Syracuse",
+ "Tennessee Volunteers": "Tennessee",
+ "Tennessee State Tigers": "Tennessee St.",
+ "Texas A&M Aggies": "Texas A&M",
+ "Texas Longhorns": "Texas",
+ "Texas State Bobcats": "Texas St.",
+ "Texas Tech Red Raiders": "Texas Tech",
+ "UCF Knights": "UCF",
+ "UCLA Bruins": "UCLA",
+ "UConn Huskies": "Connecticut",
+ "USC Trojans": "USC",
+ "Utah State Aggies": "Utah St.",
+ "VCU Rams": "VCU",
+ "Villanova Wildcats": "Villanova",
+ "Virginia Cavaliers": "Virginia",
+ "Virginia Tech Hokies": "Virginia Tech",
+ "Wake Forest Demon Deacons": "Wake Forest",
+ "Washington Huskies": "Washington",
+ "Washington State Cougars": "Washington St.",
+ "West Virginia Mountaineers": "West Virginia",
+ "Western Kentucky Hilltoppers": "Western Kentucky",
+ "Wisconsin Badgers": "Wisconsin",
+ "Wofford Terriers": "Wofford",
+ "Wyoming Cowboys": "Wyoming",
+}
+
+
+def load_team_mapping() -> pd.DataFrame:
+ """Load the canonical team mapping table."""
+ mapping_path = Path("data/processed/team_mapping.parquet")
+ if not mapping_path.exists():
+ raise FileNotFoundError(f"Team mapping not found: {mapping_path}")
+
+ df = read_parquet_df(str(mapping_path))
+ logger.info(f"Loaded team mapping with {len(df)} teams")
+ return df
+
+
+def load_odds_api_teams() -> pd.DataFrame:
+ """Load Odds API team names from recent data."""
+ # Find most recent odds file
+ odds_dir = Path("data/odds_api/sample")
+ if not odds_dir.exists():
+ raise FileNotFoundError(
+ "Odds API sample data not found. Run collect_odds_api_sample.py first."
+ )
+
+ odds_files = list(odds_dir.glob("ncaab_odds_*.parquet"))
+ if not odds_files:
+ raise FileNotFoundError("No Odds API sample files found")
+
+ # Use most recent file
+ latest_file = max(odds_files, key=lambda p: p.stat().st_mtime)
+ logger.info(f"Loading Odds API data from {latest_file}")
+
+ df = read_parquet_df(str(latest_file))
+
+ # Extract unique team names
+ home_teams = df["home_team"].unique()
+ away_teams = df["away_team"].unique()
+ all_teams = sorted(set(list(home_teams) + list(away_teams)))
+
+ logger.info(f"Found {len(all_teams)} unique teams in Odds API data")
+ return pd.DataFrame({"odds_api_name": all_teams})
+
+
+def fuzzy_match_with_mascot(
+ odds_api_name: str, kenpom_names: list[str], threshold: int = 85
+) -> str | None:
+ """Find best matching KenPom team name.
+
+ Odds API includes mascots (e.g., "Alabama Crimson Tide"), so we need
+ to extract the core team name for matching.
+
+ Args:
+ odds_api_name: Full Odds API team name with mascot
+ kenpom_names: List of KenPom team names
+ threshold: Minimum similarity score (0-100)
+
+ Returns:
+ Best matching KenPom name or None if no good match
+ """
+ # Extract core team name (first part before mascot)
+ # e.g., "Alabama Crimson Tide" -> "Alabama"
+ # e.g., "East Tennessee St Buccaneers" -> "East Tennessee St"
+ core_name = odds_api_name
+
+ # Common mascot patterns to remove
+ mascots = [
+ " Crimson Tide",
+ " Golden Griffins",
+ " Mocs",
+ " Vikings",
+ " Chanticleers",
+ " Buffaloes",
+ " Dukes",
+ " Pirates",
+ " Buccaneers",
+ " Stags",
+ " Owls",
+ " Gators",
+ " Paladins",
+ " Phoenix",
+ " Fighting Illini",
+ " Hawkeyes",
+ " Cyclones",
+ " Wildcats",
+ " Jaspers",
+ " Red Foxes",
+ " Terrapins",
+ " Tigers",
+ " Warriors",
+ " Panthers",
+ " Golden Gophers",
+ " Mountaineers",
+ " Cornhuskers",
+ " Purple Eagles",
+ " Norse",
+ " Golden Grizzlies",
+ " Ducks",
+ " Nittany Lions",
+ " Boilermakers",
+ " Bobcats",
+ " Rams",
+ " Broncs",
+ " Pioneers",
+ " Peacocks",
+ " Bulldogs",
+ " Saints",
+ " Tommies",
+ " Horned Frogs",
+ " Green Wave",
+ " Golden Hurricane",
+ " Warhawks",
+ " Kangaroos",
+ " Catamounts",
+ " Shockers",
+ " Terriers",
+ " Raiders",
+ ]
+
+ for mascot in mascots:
+ if core_name.endswith(mascot):
+ core_name = core_name[: -len(mascot)].strip()
+ break
+
+ best_score = 0
+ best_match = None
+
+ for kenpom_name in kenpom_names:
+ # Compare full names
+ score_full = fuzz.ratio(odds_api_name.lower(), kenpom_name.lower())
+
+ # Compare core names (more important)
+ score_core = fuzz.ratio(core_name.lower(), kenpom_name.lower())
+
+ # Use best score, but prioritize core match
+ score = max(score_full, score_core * 1.2) # Boost core match
+
+ if score > best_score:
+ best_score = score
+ best_match = kenpom_name
+
+ if best_score >= threshold:
+ return best_match
+ return None
+
+
+def map_odds_api_to_canonical(mapping_df: pd.DataFrame, odds_api_df: pd.DataFrame) -> pd.DataFrame:
+ """Map Odds API teams to canonical team mapping."""
+ kenpom_names = mapping_df["kenpom_name"].tolist()
+ matches = []
+ unmatched = []
+
+ for odds_api_name in odds_api_df["odds_api_name"]:
+ # Try manual mapping first
+ if odds_api_name in MANUAL_ODDS_API_MAPPINGS:
+ kenpom_match = MANUAL_ODDS_API_MAPPINGS[odds_api_name]
+ match_type = "manual"
+ else:
+ # Try fuzzy matching with mascot handling
+ kenpom_match = fuzzy_match_with_mascot(odds_api_name, kenpom_names)
+ match_type = "fuzzy" if kenpom_match else None
+
+ if kenpom_match:
+ matches.append(
+ {
+ "kenpom_name": kenpom_match,
+ "odds_api_name": odds_api_name,
+ "match_type": match_type,
+ }
+ )
+ else:
+ unmatched.append(odds_api_name)
+
+ logger.info(f"Matched {len(matches)} Odds API teams")
+ logger.info(f" Manual matches: {sum(1 for m in matches if m['match_type'] == 'manual')}")
+ logger.info(f" Fuzzy matches: {sum(1 for m in matches if m['match_type'] == 'fuzzy')}")
+
+ if unmatched:
+ logger.warning(f"Unmatched Odds API teams ({len(unmatched)}): {unmatched}")
+
+ # Update mapping DataFrame
+ matches_df = pd.DataFrame(matches)
+ mapping_df = mapping_df.merge(
+ matches_df[["kenpom_name", "odds_api_name"]],
+ on="kenpom_name",
+ how="left",
+ suffixes=("_old", ""),
+ )
+
+ # Drop old columns
+ cols_to_drop = [c for c in mapping_df.columns if c.endswith("_old")]
+ mapping_df = mapping_df.drop(columns=cols_to_drop)
+
+ return mapping_df
+
+
+def save_team_mapping(mapping_df: pd.DataFrame, output_path: Path) -> None:
+ """Save updated team mapping table."""
+ write_parquet(mapping_df, str(output_path), index=False)
+ logger.info(f"Saved updated team mapping to {output_path}")
+
+
+def main() -> None:
+ """Map Odds API teams to canonical mapping."""
+ logger.info("Starting Odds API team mapping...")
+
+ # Load data
+ mapping_df = load_team_mapping()
+ odds_api_df = load_odds_api_teams()
+
+ # Map Odds API to canonical
+ mapping_df = map_odds_api_to_canonical(mapping_df, odds_api_df)
+
+ # Summary
+ odds_api_mapped = mapping_df["odds_api_name"].notna().sum()
+ logger.info("\nOdds API Mapping Summary:")
+ logger.info(f" Teams with Odds API data: {odds_api_mapped} / {len(mapping_df)}")
+ logger.info(f" Coverage: {odds_api_mapped / len(mapping_df) * 100:.1f}%")
+
+ # Save updated mapping
+ output_path = Path("data/processed/team_mapping.parquet")
+ save_team_mapping(mapping_df, output_path)
+
+ logger.info("\nAll data sources mapped!")
+ logger.info("Team mapping system is complete.")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/map_overtime_teams.py b/scripts/processing/map_overtime_teams.py
new file mode 100644
index 000000000..beb156e10
--- /dev/null
+++ b/scripts/processing/map_overtime_teams.py
@@ -0,0 +1,318 @@
+"""Map Overtime.ag team names to canonical team mapping.
+
+This script matches Overtime.ag team names to the canonical KenPom-based
+team mapping table. Overtime uses simpler naming conventions.
+
+Usage:
+ uv run python scripts/map_overtime_teams.py
+
+Output:
+ Updates data/processed/team_mapping.parquet with Overtime fields
+"""
+
+import logging
+from pathlib import Path
+
+import pandas as pd
+from thefuzz import fuzz
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df, write_parquet
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+# Manual mappings for Overtime.ag name variations
+MANUAL_OVERTIME_MAPPINGS = {
+ # Overtime name -> KenPom TeamName
+ "Alabama A&M": "Alabama A&M",
+ "Alabama State": "Alabama St.",
+ "Appalachian State": "Appalachian St.",
+ "Arizona State": "Arizona St.",
+ "Arkansas Little Rock": "Little Rock",
+ "Arkansas State": "Arkansas St.",
+ "Arkansas-Pine Bluff": "Arkansas Pine Bluff",
+ "Ball State": "Ball St.",
+ "Bethune-Cookman": "Bethune Cookman",
+ "Boise State": "Boise St.",
+ "Bowling Green": "Bowling Green",
+ "Cal Poly": "Cal Poly",
+ "Cal Poly SLO": "Cal Poly",
+ "Cal Riverside": "UC Riverside",
+ "Cal State Bakersfield": "CS Bakersfield",
+ "Cal State Fullerton": "CS Fullerton",
+ "Cal State Northridge": "CS Northridge",
+ "California Baptist": "Cal Baptist",
+ "Coll Of Charleston": "Charleston",
+ "CS Bakersfield": "Cal St. Bakersfield",
+ "CS Fullerton": "Cal St. Fullerton",
+ "CS Northridge": "CSUN",
+ "East Tenn State": "East Tennessee St.",
+ "Idaho State": "Idaho St.",
+ "IPFW": "Purdue Fort Wayne",
+ "Central Connecticut": "Central Connecticut St.",
+ "Central Florida": "UCF",
+ "Coastal Carolina": "Coastal Carolina",
+ "Colorado State": "Colorado St.",
+ "Delaware State": "Delaware St.",
+ "Eastern Illinois": "Eastern Illinois",
+ "Eastern Kentucky": "Eastern Kentucky",
+ "Eastern Michigan": "Eastern Michigan",
+ "Eastern Washington": "Eastern Washington",
+ "Florida A&M": "Florida A&M",
+ "Florida Atlantic": "Florida Atlantic",
+ "Florida Gulf Coast": "Florida Gulf Coast",
+ "Florida International": "FIU",
+ "Florida State": "Florida St.",
+ "Fresno State": "Fresno St.",
+ "Georgia Southern": "Georgia Southern",
+ "Georgia State": "Georgia St.",
+ "Georgia Tech": "Georgia Tech",
+ "Grambling State": "Grambling",
+ "Grand Canyon": "Grand Canyon",
+ "Illinois State": "Illinois St.",
+ "Indiana State": "Indiana St.",
+ "Iowa State": "Iowa St.",
+ "Jackson State": "Jackson St.",
+ "Jacksonville State": "Jacksonville St.",
+ "Kansas State": "Kansas St.",
+ "Kent State": "Kent St.",
+ "Long Beach State": "Long Beach St.",
+ "Louisiana State": "Louisiana St.",
+ "Miami (FL)": "Miami FL",
+ "Miami Florida": "Miami FL",
+ "Miami (OH)": "Miami OH",
+ "Miami Ohio": "Miami OH",
+ "Michigan State": "Michigan St.",
+ "Middle Tennessee": "Middle Tennessee",
+ "Middle Tenn St": "Middle Tennessee",
+ "Mississippi State": "Mississippi St.",
+ "Missouri State": "Missouri St.",
+ "Montana State": "Montana St.",
+ "Morehead State": "Morehead St.",
+ "Morgan State": "Morgan St.",
+ "Murray State": "Murray St.",
+ "New Mexico State": "New Mexico St.",
+ "Norfolk State": "Norfolk St.",
+ "North Carolina A&T": "North Carolina A&T",
+ "North Carolina Central": "N.C. Central",
+ "NC State": "N.C. State",
+ "North Carolina State": "N.C. State",
+ "No. Colorado": "Northern Colorado",
+ "North Dakota State": "North Dakota St.",
+ "Northern Arizona": "Northern Arizona",
+ "Northern Colorado": "Northern Colorado",
+ "Northern Illinois": "Northern Illinois",
+ "Northern Iowa": "Northern Iowa",
+ "Northern Kentucky": "Northern Kentucky",
+ "Northwestern State": "Northwestern St.",
+ "Ohio State": "Ohio St.",
+ "Oklahoma State": "Oklahoma St.",
+ "Old Dominion": "Old Dominion",
+ "Oral Roberts": "Oral Roberts",
+ "Oregon State": "Oregon St.",
+ "Penn State": "Penn St.",
+ "Portland State": "Portland St.",
+ "Prairie View A&M": "Prairie View",
+ "Sacramento State": "Sacramento St.",
+ "Saint Marys CA": "Saint Mary's",
+ "Sam Houston": "Sam Houston St.",
+ "Sam Houston State": "Sam Houston St.",
+ "San Diego": "San Diego",
+ "SE Missouri State": "Southeast Missouri",
+ "SIU Edwardsville": "SIUE",
+ "San Diego State": "San Diego St.",
+ "San Francisco": "San Francisco",
+ "San Jose State": "San Jose St.",
+ "South Alabama": "South Alabama",
+ "South Carolina": "South Carolina",
+ "South Carolina State": "South Carolina St.",
+ "South Dakota": "South Dakota",
+ "South Dakota State": "South Dakota St.",
+ "South Florida": "South Florida",
+ "Southeast Missouri State": "Southeast Missouri St.",
+ "Southeastern Louisiana": "SE Louisiana",
+ "Southern Illinois": "Southern Illinois",
+ "Southern Miss": "Southern Miss",
+ "St. Bonaventure": "St. Bonaventure",
+ "St. John's": "St. John's",
+ "St. Josephs": "Saint Joseph's",
+ "Stephen F. Austin": "Stephen F. Austin",
+ "Tarleton State": "Tarleton St.",
+ "Stony Brook": "Stony Brook",
+ "Tennessee State": "Tennessee St.",
+ "Tennessee Tech": "Tennessee Tech",
+ "Texas A&M": "Texas A&M",
+ "Texas A&M-Corpus Christi": "Texas A&M Corpus Chris",
+ "Texas State": "Texas St.",
+ "Texas Tech": "Texas Tech",
+ "UC Davis": "UC Davis",
+ "UC Irvine": "UC Irvine",
+ "UC Riverside": "UC Riverside",
+ "UC San Diego": "UC San Diego",
+ "UC Santa Barbara": "UC Santa Barbara",
+ "UConn": "Connecticut",
+ "UL": "Louisiana",
+ "UMass Lowell": "UMass Lowell",
+ "UNC Asheville": "UNC Asheville",
+ "UNC Greensboro": "UNC Greensboro",
+ "UNC Wilmington": "UNC Wilmington",
+ "USC": "USC",
+ "UT Arlington": "UT Arlington",
+ "UT Rio Grande Valley": "UT Rio Grande Valley",
+ "UT San Antonio": "UT San Antonio",
+ "Utah State": "Utah St.",
+ "Utah Valley": "Utah Valley",
+ "VCU": "VCU",
+ "Virginia Tech": "Virginia Tech",
+ "Washington State": "Washington St.",
+ "Weber State": "Weber St.",
+ "Western Carolina": "Western Carolina",
+ "Western Illinois": "Western Illinois",
+ "Western Kentucky": "Western Kentucky",
+ "Western Michigan": "Western Michigan",
+ "Wichita State": "Wichita St.",
+ "William & Mary": "William & Mary",
+ "Winthrop": "Winthrop",
+ "Wright State": "Wright St.",
+ "Wyoming": "Wyoming",
+}
+
+
+def load_team_mapping() -> pd.DataFrame:
+ """Load the canonical team mapping table."""
+ mapping_path = Path("data/processed/team_mapping.parquet")
+ if not mapping_path.exists():
+ raise FileNotFoundError(f"Team mapping not found: {mapping_path}")
+
+ df = read_parquet_df(str(mapping_path))
+ logger.info(f"Loaded team mapping with {len(df)} teams")
+ return df
+
+
+def load_overtime_teams() -> pd.DataFrame:
+ """Load Overtime.ag team names from all available data."""
+ overtime_dir = Path("data/overtime")
+ if not overtime_dir.exists():
+ raise FileNotFoundError(f"Overtime directory not found: {overtime_dir}")
+
+ # Find all parquet files
+ parquet_files = list(overtime_dir.glob("*.parquet"))
+ if not parquet_files:
+ raise FileNotFoundError("No Overtime parquet files found")
+
+ logger.info(f"Loading from {len(parquet_files)} Overtime files")
+
+ # Collect all unique team names across all files
+ all_teams = set()
+ for file in parquet_files:
+ df = read_parquet_df(str(file))
+ home_teams = df["home_team"].unique()
+ away_teams = df["away_team"].unique()
+ all_teams.update(home_teams)
+ all_teams.update(away_teams)
+
+ all_teams_sorted = sorted(all_teams)
+ logger.info(f"Found {len(all_teams_sorted)} unique teams across all Overtime data")
+ return pd.DataFrame({"overtime_name": all_teams_sorted})
+
+
+def map_overtime_to_canonical(mapping_df: pd.DataFrame, overtime_df: pd.DataFrame) -> pd.DataFrame:
+ """Map Overtime teams to canonical team mapping."""
+ kenpom_names = mapping_df["kenpom_name"].tolist()
+ matches = []
+ unmatched = []
+
+ for overtime_name in overtime_df["overtime_name"]:
+ # Try manual mapping first
+ if overtime_name in MANUAL_OVERTIME_MAPPINGS:
+ kenpom_match = MANUAL_OVERTIME_MAPPINGS[overtime_name]
+ match_type = "manual"
+ # Try exact match
+ elif overtime_name in kenpom_names:
+ kenpom_match = overtime_name
+ match_type = "exact"
+ # Try fuzzy match as last resort
+ else:
+ best_score = 0
+ best_match = None
+ for kenpom_name in kenpom_names:
+ score = fuzz.ratio(overtime_name.lower(), kenpom_name.lower())
+ if score > best_score:
+ best_score = score
+ best_match = kenpom_name
+
+ if best_score >= 90: # High threshold for auto-matching
+ kenpom_match = best_match
+ match_type = "fuzzy"
+ else:
+ kenpom_match = None
+ match_type = None
+
+ if kenpom_match:
+ matches.append(
+ {
+ "kenpom_name": kenpom_match,
+ "overtime_name": overtime_name,
+ "match_type": match_type,
+ }
+ )
+ else:
+ unmatched.append(overtime_name)
+
+ logger.info(f"Matched {len(matches)} Overtime teams")
+ logger.info(f" Manual matches: {sum(1 for m in matches if m['match_type'] == 'manual')}")
+ logger.info(f" Exact matches: {sum(1 for m in matches if m['match_type'] == 'exact')}")
+ logger.info(f" Fuzzy matches: {sum(1 for m in matches if m['match_type'] == 'fuzzy')}")
+
+ if unmatched:
+ logger.warning(f"Unmatched Overtime teams ({len(unmatched)}): {unmatched}")
+
+ # Update mapping DataFrame
+ matches_df = pd.DataFrame(matches)
+ mapping_df = mapping_df.merge(
+ matches_df[["kenpom_name", "overtime_name"]],
+ on="kenpom_name",
+ how="left",
+ suffixes=("_old", ""),
+ )
+
+ # Drop old columns
+ cols_to_drop = [c for c in mapping_df.columns if c.endswith("_old")]
+ mapping_df = mapping_df.drop(columns=cols_to_drop)
+
+ return mapping_df
+
+
+def save_team_mapping(mapping_df: pd.DataFrame, output_path: Path) -> None:
+ """Save updated team mapping table."""
+ write_parquet(mapping_df, str(output_path), index=False)
+ logger.info(f"Saved updated team mapping to {output_path}")
+
+
+def main() -> None:
+ """Map Overtime.ag teams to canonical mapping."""
+ logger.info("Starting Overtime.ag team mapping...")
+
+ # Load data
+ mapping_df = load_team_mapping()
+ overtime_df = load_overtime_teams()
+
+ # Map Overtime to canonical
+ mapping_df = map_overtime_to_canonical(mapping_df, overtime_df)
+
+ # Summary
+ overtime_mapped = mapping_df["overtime_name"].notna().sum()
+ logger.info("\nOvertime.ag Mapping Summary:")
+ logger.info(f" Teams with Overtime data: {overtime_mapped} / {len(mapping_df)}")
+ logger.info(f" Coverage: {overtime_mapped / len(mapping_df) * 100:.1f}%")
+
+ # Save updated mapping
+ output_path = Path("data/processed/team_mapping.parquet")
+ save_team_mapping(mapping_df, output_path)
+
+ logger.info("\nNext step: Collect Odds API data and run python scripts/map_odds_api_teams.py")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/migrate_orphaned_scores.py b/scripts/processing/migrate_orphaned_scores.py
new file mode 100644
index 000000000..51905387e
--- /dev/null
+++ b/scripts/processing/migrate_orphaned_scores.py
@@ -0,0 +1,206 @@
+"""Migrate orphaned scores to canonical event IDs.
+
+After deduplication, some scores may still point to deleted event_ids.
+This script finds those orphaned scores and migrates them to the canonical
+event_id by matching on (home_team, away_team, game_date).
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from pathlib import Path
+
+import pandas as pd
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+
+def normalize_team_name(name: str) -> str:
+ """Normalize team name for matching."""
+ mapping_path = Path("data/staging/mappings/team_mapping.parquet")
+ if mapping_path.exists():
+ try:
+ mapping_df = read_parquet_df(str(mapping_path))
+ for col in ["odds_api_name", "espn_name", "kenpom_name"]:
+ if col in mapping_df.columns:
+ match = mapping_df[mapping_df[col] == name]
+ if len(match) > 0:
+ return str(match.iloc[0]["odds_api_name"])
+ except Exception:
+ pass
+ return name.strip()
+
+
+def find_matching_event(
+ home_team: str, away_team: str, game_date: str, db: OddsAPIDatabase
+) -> str | None:
+ """Find matching event in events table.
+
+ Args:
+ home_team: Home team name
+ away_team: Away team name
+ game_date: Game date (YYYY-MM-DD)
+ db: Database connection
+
+ Returns:
+ Event ID if found, None otherwise
+ """
+ # Normalize team names
+ home_norm = normalize_team_name(home_team)
+ away_norm = normalize_team_name(away_team)
+
+ # Try exact match first
+ query = """
+ SELECT event_id
+ FROM events
+ WHERE home_team = ? AND away_team = ? AND DATE(commence_time) = ?
+ LIMIT 1
+ """
+ result = db.conn.execute(query, (home_norm, away_norm, game_date)).fetchone()
+
+ if result:
+ return result[0]
+
+ # Try with original names
+ result = db.conn.execute(query, (home_team, away_team, game_date)).fetchone()
+
+ if result:
+ return result[0]
+
+ # Try swapping home/away (rare but possible)
+ result = db.conn.execute(query, (away_norm, home_norm, game_date)).fetchone()
+
+ if result:
+ logger.warning(f"Found match with swapped home/away: {home_team} @ {away_team}")
+ return result[0]
+
+ return None
+
+
+def migrate_orphaned_scores(db: OddsAPIDatabase, dry_run: bool = False) -> None:
+ """Migrate orphaned scores to canonical event IDs.
+
+ Args:
+ db: Database connection
+ dry_run: If True, only show what would be done
+ """
+ logger.info("Finding orphaned scores...")
+
+ # Find scores that don't have a matching event
+ query = """
+ SELECT
+ s.event_id,
+ s.home_score,
+ s.away_score
+ FROM scores s
+ LEFT JOIN events e ON s.event_id = e.event_id
+ WHERE e.event_id IS NULL
+ """
+
+ orphaned_scores = pd.read_sql_query(query, db.conn)
+ logger.info(f"Found {len(orphaned_scores)} orphaned scores")
+
+ if len(orphaned_scores) == 0:
+ logger.info("No orphaned scores to migrate!")
+ return
+
+ # Get event details for orphaned scores from ESPN scores table
+ migrated = 0
+ not_found = 0
+
+ for _, score_row in orphaned_scores.iterrows():
+ old_event_id = score_row["event_id"]
+
+ # Try to get event details from ESPN scores
+ espn_query = """
+ SELECT
+ home_team,
+ away_team,
+ game_date
+ FROM espn_scores
+ WHERE espn_event_id = ?
+ """
+ espn_result = db.conn.execute(espn_query, (old_event_id,)).fetchone()
+
+ if not espn_result:
+ # Try extracting from scores table metadata (if available)
+ # For now, skip if we can't find event details
+ logger.warning(f"Could not find ESPN data for orphaned score: {old_event_id}")
+ not_found += 1
+ continue
+
+ home_team, away_team, game_date = espn_result
+
+ # Find matching canonical event
+ canonical_id = find_matching_event(home_team, away_team, game_date, db)
+
+ if canonical_id:
+ logger.info(f"Found match: {old_event_id} -> {canonical_id}")
+
+ if dry_run:
+ logger.info(f" [DRY RUN] Would migrate score to {canonical_id}")
+ else:
+ # Check if canonical already has score
+ check_query = "SELECT COUNT(*) FROM scores WHERE event_id = ?"
+ has_score = db.conn.execute(check_query, (canonical_id,)).fetchone()[0] > 0
+
+ if not has_score:
+ # Migrate score
+ update_query = """
+ UPDATE scores
+ SET event_id = ?
+ WHERE event_id = ?
+ """
+ db.conn.execute(update_query, (canonical_id, old_event_id))
+ logger.info(f" Migrated score to {canonical_id}")
+ migrated += 1
+ else:
+ # Delete duplicate
+ delete_query = "DELETE FROM scores WHERE event_id = ?"
+ db.conn.execute(delete_query, (old_event_id,))
+ logger.info(" Deleted duplicate score (canonical already has one)")
+
+ db.conn.commit()
+ else:
+ logger.warning(
+ f"Could not find matching event for: {home_team} @ {away_team} on {game_date}"
+ )
+ not_found += 1
+
+ logger.info("\n=== SUMMARY ===")
+ logger.info(f"Orphaned scores found: {len(orphaned_scores)}")
+ logger.info(f"Scores migrated: {migrated}")
+ logger.info(f"Not found: {not_found}")
+
+ if not dry_run and migrated > 0:
+ logger.info("[OK] Score migration complete!")
+
+
+def main() -> None:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(description="Migrate orphaned scores")
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Show what would be done without making changes",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=str,
+ default="data/odds_api/odds_api.sqlite3",
+ help="Path to SQLite database",
+ )
+
+ args = parser.parse_args()
+
+ db = OddsAPIDatabase(args.db_path)
+ migrate_orphaned_scores(db, dry_run=args.dry_run)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/processing/verify_team_mapping.py b/scripts/processing/verify_team_mapping.py
new file mode 100644
index 000000000..c3e974622
--- /dev/null
+++ b/scripts/processing/verify_team_mapping.py
@@ -0,0 +1,40 @@
+"""Verify team name mapping quality."""
+
+from __future__ import annotations
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+
+df = read_parquet_df("data/staging/mappings/team_mapping.parquet")
+
+print("=== Verification of Known Teams ===")
+test_teams = [
+ "Duke",
+ "Kentucky",
+ "Kansas",
+ "North Carolina",
+ "Gonzaga",
+ "Michigan",
+ "Alabama",
+ "UCLA",
+]
+
+for team in test_teams:
+ matches = df[df["kenpom_name"] == team]
+ if len(matches) > 0:
+ row = matches.iloc[0]
+ print(f"\n{team}:")
+ print(f" Odds API: {row['odds_api_name']} (score: {row['odds_api_match_score']})")
+ if row["espn_name"]:
+ print(f" ESPN: {row['espn_name']} (score: {row['espn_match_score']})")
+ else:
+ print(f"\n{team}: NOT FOUND")
+
+print("\n\n=== Mapping Statistics ===")
+print(f"Total teams: {len(df)}")
+odds_matched = (df["odds_api_name"] != "").sum()
+espn_matched = (df["espn_name"] != "").sum()
+high_quality = ((df["odds_api_match_score"] >= 90) | (df["espn_match_score"] >= 90)).sum()
+
+print(f"Matched to Odds API: {odds_matched} ({odds_matched / len(df):.1%})")
+print(f"Matched to ESPN: {espn_matched} ({espn_matched / len(df):.1%})")
+print(f"High quality matches (score>=90): {high_quality} ({high_quality / len(df):.1%})")
diff --git a/scripts/training/.gitkeep b/scripts/training/.gitkeep
new file mode 100644
index 000000000..e69de29bb
diff --git a/scripts/training/__pycache__/train_score_models.cpython-313.pyc b/scripts/training/__pycache__/train_score_models.cpython-313.pyc
new file mode 100644
index 000000000..cc82930b3
Binary files /dev/null and b/scripts/training/__pycache__/train_score_models.cpython-313.pyc differ
diff --git a/scripts/training/calibrate_win_probability.py b/scripts/training/calibrate_win_probability.py
new file mode 100644
index 000000000..54111ebd0
--- /dev/null
+++ b/scripts/training/calibrate_win_probability.py
@@ -0,0 +1,241 @@
+#!/usr/bin/env python3
+"""Calibrate win probability formula by comparing to KenPom FanMatch predictions.
+
+Uses KenPom's published win probabilities (HomeWP) to find the optimal constant
+for converting efficiency margin to win probability.
+
+Usage:
+ uv run python scripts/calibrate_win_probability.py
+ uv run python scripts/calibrate_win_probability.py --dates 2026-01-18 2026-01-25 2026-01-28
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import pyarrow.parquet as pq
+
+from sports_betting_edge.utils.team_matching import match_to_kenpom
+
+
+def kenpom_win_probability(team_rating: float, opp_rating: float, constant: float) -> float:
+ """Calculate win probability from efficiency ratings with adjustable constant.
+
+ Args:
+ team_rating: Team's AdjEM
+ opp_rating: Opponent's AdjEM
+ constant: Scaling constant for conversion
+
+ Returns:
+ Win probability (0-1)
+ """
+ diff = team_rating - opp_rating
+ return 1 / (1 + 10 ** (-diff / constant))
+
+
+def load_fanmatch_games(fanmatch_path: Path) -> list[dict]:
+ """Load FanMatch games from parquet."""
+ table = pq.read_table(fanmatch_path)
+ return table.to_pylist()
+
+
+def load_kenpom_ratings(ratings_path: Path) -> dict[str, float]:
+ """Load KenPom ratings (AdjEM only)."""
+ table = pq.read_table(ratings_path)
+ ratings = {}
+ for i in range(len(table)):
+ team = table["TeamName"][i].as_py()
+ ratings[team] = table["AdjEM"][i].as_py()
+ return ratings
+
+
+def calculate_errors(
+ games: list[dict], ratings: dict[str, float], constant: float
+) -> tuple[float, int]:
+ """Calculate mean absolute error for a given constant.
+
+ Args:
+ games: List of FanMatch game dicts
+ ratings: Dict of team ratings (AdjEM)
+ constant: Constant to test
+
+ Returns:
+ Tuple of (mean_absolute_error, num_games_matched)
+ """
+ errors = []
+
+ for game in games:
+ visitor = game["Visitor"]
+ home = game["Home"]
+
+ # Match to ratings
+ visitor_matched = match_to_kenpom(visitor, list(ratings.keys()))
+ home_matched = match_to_kenpom(home, list(ratings.keys()))
+
+ if not visitor_matched or not home_matched:
+ continue
+
+ visitor_rating = ratings[visitor_matched]
+ home_rating = ratings[home_matched]
+
+ # KenPom's published win probability (HomeWP is 0-100)
+ kenpom_home_wp = game["HomeWP"] / 100.0
+
+ # Our calculated win probability with this constant
+ # Note: KenPom includes HCA in their ratings, so we don't add it here
+ calculated_home_wp = kenpom_win_probability(home_rating, visitor_rating, constant)
+
+ # Calculate error
+ error = abs(kenpom_home_wp - calculated_home_wp)
+ errors.append(error)
+
+ if not errors:
+ return float("inf"), 0
+
+ return sum(errors) / len(errors), len(errors)
+
+
+def test_constants(
+ games: list[dict], ratings: dict[str, float], constants: list[float]
+) -> list[tuple[float, float, int]]:
+ """Test multiple constants and return results.
+
+ Args:
+ games: List of FanMatch games
+ ratings: Dict of team ratings
+ constants: List of constants to test
+
+ Returns:
+ List of (constant, mae, num_games) sorted by MAE
+ """
+ results = []
+
+ for constant in constants:
+ mae, num_games = calculate_errors(games, ratings, constant)
+ results.append((constant, mae, num_games))
+
+ results.sort(key=lambda x: x[1])
+ return results
+
+
+def main() -> int:
+ """Main entry point."""
+ parser = argparse.ArgumentParser(
+ description="Calibrate win probability formula using KenPom FanMatch data"
+ )
+ parser.add_argument(
+ "--dates",
+ nargs="+",
+ default=["2026-01-28"],
+ help="Dates to use for calibration (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--kenpom-dir",
+ type=Path,
+ default=Path("./data/kenpom"),
+ help="KenPom data directory",
+ )
+ parser.add_argument(
+ "--ratings-date",
+ type=str,
+ default="2026-01-31",
+ help="Date of ratings file to use (YYYY-MM-DD)",
+ )
+ parser.add_argument(
+ "--min-constant",
+ type=float,
+ default=10.0,
+ help="Minimum constant to test",
+ )
+ parser.add_argument(
+ "--max-constant",
+ type=float,
+ default=30.0,
+ help="Maximum constant to test",
+ )
+ parser.add_argument(
+ "--step",
+ type=float,
+ default=0.5,
+ help="Step size for testing constants",
+ )
+
+ args = parser.parse_args()
+
+ try:
+ # Load ratings
+ ratings_path = args.kenpom_dir / "ratings" / f"{args.ratings_date}.parquet"
+ print(f"[OK] Loading ratings from {ratings_path}")
+ ratings = load_kenpom_ratings(ratings_path)
+ print(f"[OK] Loaded {len(ratings)} team ratings")
+
+ # Load FanMatch games
+ all_games = []
+ for date_str in args.dates:
+ fanmatch_path = args.kenpom_dir / "fanmatch" / f"{date_str}.parquet"
+ if not fanmatch_path.exists():
+ print(f"[WARNING] FanMatch data not found for {date_str}, skipping")
+ continue
+
+ games = load_fanmatch_games(fanmatch_path)
+ all_games.extend(games)
+ print(f"[OK] Loaded {len(games)} games from {date_str}")
+
+ if not all_games:
+ print("[ERROR] No games loaded")
+ return 1
+
+ print(f"\n[OK] Total games: {len(all_games)}")
+
+ # Test constants
+ print(
+ f"\n[OK] Testing constants from {args.min_constant} to "
+ f"{args.max_constant} (step={args.step})..."
+ )
+ constants = [
+ args.min_constant + i * args.step
+ for i in range(int((args.max_constant - args.min_constant) / args.step) + 1)
+ ]
+
+ results = test_constants(all_games, ratings, constants)
+
+ # Display results
+ print("\n=== Top 10 Constants by Mean Absolute Error ===\n")
+ print(f"{'Constant':<12} {'MAE':<12} {'Games Matched':<15}")
+ print("-" * 40)
+
+ for constant, mae, num_games in results[:10]:
+ print(f"{constant:<12.1f} {mae:<12.6f} {num_games:<15}")
+
+ # Best constant
+ best_constant, best_mae, best_games = results[0]
+
+ print("\n=== Optimal Constant ===")
+ print(f"Constant: {best_constant:.1f}")
+ print(f"Mean Absolute Error: {best_mae:.6f} ({best_mae * 100:.4f}%)")
+ print(f"Games Matched: {best_games}/{len(all_games)}")
+
+ print(f"\n[OK] Update kenpom_win_probability() with constant={best_constant}")
+ print(
+ f" win_prob = 1 / (1 + 10 ** (-diff / {best_constant:.1f})) "
+ f"# Calibrated from KenPom FanMatch"
+ )
+
+ return 0
+
+ except FileNotFoundError as e:
+ print(f"[ERROR] {e}")
+ return 1
+ except Exception as e:
+ print(f"[ERROR] Calibration failed: {e}")
+ import traceback
+
+ traceback.print_exc()
+ return 1
+
+
+if __name__ == "__main__":
+ import sys
+
+ sys.exit(main())
diff --git a/scripts/training/run_walk_forward_backtest.py b/scripts/training/run_walk_forward_backtest.py
new file mode 100644
index 000000000..01214bddc
--- /dev/null
+++ b/scripts/training/run_walk_forward_backtest.py
@@ -0,0 +1,205 @@
+"""Run walk-forward backtest for score regression models.
+
+Retrains score models on expanding/rolling historical windows,
+measures true out-of-sample performance, and stores results.
+
+Usage:
+ uv run python scripts/training/run_walk_forward_backtest.py
+ uv run python scripts/training/run_walk_forward_backtest.py --start 2025-12-01 --end 2026-02-09
+ uv run python scripts/training/run_walk_forward_backtest.py --step 7 --window expanding
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import uuid
+from datetime import date, datetime
+from pathlib import Path
+from typing import Any
+
+from sports_betting_edge.adapters.odds_api_db import OddsAPIDatabase
+from sports_betting_edge.config.settings import settings
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+from sports_betting_edge.services.walk_forward_validation import (
+ WalkForwardValidator,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def main() -> None:
+ """Entry point for walk-forward backtest."""
+ parser = argparse.ArgumentParser(
+ description="Walk-forward backtest for score regression models"
+ )
+ parser.add_argument(
+ "--start",
+ type=str,
+ default="2025-12-01",
+ help="Backtest start date (default: 2025-12-01)",
+ )
+ parser.add_argument(
+ "--end",
+ type=str,
+ default=None,
+ help="Backtest end date (default: today)",
+ )
+ parser.add_argument(
+ "--train-days",
+ type=int,
+ default=30,
+ help="Training window in days (default: 30)",
+ )
+ parser.add_argument(
+ "--step",
+ type=int,
+ default=7,
+ help="Step size in days (default: 7)",
+ )
+ parser.add_argument(
+ "--test-days",
+ type=int,
+ default=7,
+ help="Test window in days (default: 7)",
+ )
+ parser.add_argument(
+ "--window",
+ choices=["expanding", "rolling"],
+ default="expanding",
+ help="Window type (default: expanding)",
+ )
+ parser.add_argument(
+ "--min-train",
+ type=int,
+ default=100,
+ help="Minimum training samples (default: 100)",
+ )
+ parser.add_argument(
+ "--min-test",
+ type=int,
+ default=10,
+ help="Minimum test samples (default: 10)",
+ )
+ parser.add_argument(
+ "--season",
+ type=int,
+ default=2026,
+ help="KenPom season (default: 2026)",
+ )
+ parser.add_argument(
+ "--staging-path",
+ type=Path,
+ default=Path(str(settings.staging_dir)),
+ help="Staging data directory",
+ )
+ parser.add_argument(
+ "--db-path",
+ type=Path,
+ default=Path("data/odds_api/odds_api.sqlite3"),
+ help="Database path for storing results",
+ )
+ parser.add_argument(
+ "--store-results",
+ action="store_true",
+ help="Store results in database",
+ )
+ parser.add_argument(
+ "--output",
+ type=Path,
+ default=None,
+ help="Save results CSV to this path",
+ )
+ args = parser.parse_args()
+
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+ )
+
+ end_date = args.end or date.today().isoformat()
+
+ logger.info("=" * 80)
+ logger.info("Walk-Forward Backtest: Score Regression Models")
+ logger.info("=" * 80)
+ logger.info("Period: %s to %s", args.start, end_date)
+ logger.info(
+ "Window: %s (%d day train, %d day test, %d day step)",
+ args.window,
+ args.train_days,
+ args.test_days,
+ args.step,
+ )
+
+ # Initialize
+ engineer = FeatureEngineer(staging_path=str(args.staging_path))
+
+ validator = WalkForwardValidator(
+ train_window_days=args.train_days,
+ test_window_days=args.test_days,
+ step_days=args.step,
+ window_type=args.window,
+ min_train_samples=args.min_train,
+ min_test_samples=args.min_test,
+ )
+
+ # Run validation
+ results_df = validator.validate_score_models(
+ engineer=engineer,
+ start_date=args.start,
+ end_date=end_date,
+ season=args.season,
+ )
+
+ if results_df.empty:
+ logger.warning("No results produced - check date range and data")
+ return
+
+ # Store results in database
+ if args.store_results:
+ backtest_id = f"score_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:8]}"
+ db = OddsAPIDatabase(str(args.db_path))
+
+ for _, row in results_df.iterrows():
+ skip = {
+ "split_id",
+ "train_start",
+ "train_end",
+ "test_start",
+ "test_end",
+ "train_samples",
+ "test_samples",
+ }
+ metrics: dict[str, Any] = {
+ str(k): (float(v) if isinstance(v, int | float) else str(v))
+ for k, v in row.items()
+ if str(k) not in skip
+ }
+ db.store_backtest_result(
+ backtest_id=backtest_id,
+ split_id=int(row["split_id"]),
+ model_type="score_regression",
+ train_start=str(row["train_start"]),
+ train_end=str(row["train_end"]),
+ test_start=str(row["test_start"]),
+ test_end=str(row["test_end"]),
+ train_samples=int(row["train_samples"]),
+ test_samples=int(row["test_samples"]),
+ metrics=metrics,
+ )
+
+ logger.info("[OK] Stored results with backtest_id=%s", backtest_id)
+
+ # Save CSV
+ if args.output:
+ args.output.parent.mkdir(parents=True, exist_ok=True)
+ results_df.to_csv(args.output, index=False)
+ logger.info("[OK] Results saved to %s", args.output)
+
+ logger.info("\n" + "=" * 80)
+ logger.info("Backtest Complete")
+ logger.info("=" * 80)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/training/train_advanced_ensemble.py b/scripts/training/train_advanced_ensemble.py
new file mode 100644
index 000000000..bc3176155
--- /dev/null
+++ b/scripts/training/train_advanced_ensemble.py
@@ -0,0 +1,543 @@
+#!/usr/bin/env python3
+"""Train advanced ensemble models for score prediction.
+
+State-of-the-art ensemble combining:
+- XGBoost, LightGBM, CatBoost regressors
+- Stacked ensemble with Ridge meta-learner
+- Optuna hyperparameter optimization
+- Advanced feature engineering
+- Cross-validation with proper evaluation
+"""
+
+from __future__ import annotations
+
+import logging
+import pickle
+from datetime import datetime
+from pathlib import Path
+
+import click
+import numpy as np
+import optuna
+import pandas as pd
+from sklearn.linear_model import Ridge
+from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+)
+logger = logging.getLogger(__name__)
+
+
+def build_advanced_features(
+ staging_path: Path,
+ start_date: str,
+ end_date: str,
+) -> pd.DataFrame:
+ """Build advanced features including rolling stats and interactions.
+
+ Args:
+ staging_path: Path to staging data directory
+ start_date: Start date (YYYY-MM-DD)
+ end_date: End date (YYYY-MM-DD)
+
+ Returns:
+ DataFrame with advanced features and targets
+ """
+ from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+ engineer = FeatureEngineer(staging_path=str(staging_path))
+
+ # Load raw merged data
+ logger.info(f"Loading data from {start_date} to {end_date}...")
+ start_dt = datetime.fromisoformat(start_date).date()
+ end_dt = datetime.fromisoformat(end_date).date()
+
+ merged = engineer.load_staging_data(
+ start_dt,
+ end_dt,
+ season=2026,
+ require_line_features=False,
+ use_home_away=True,
+ )
+
+ logger.info(f"Loaded {len(merged)} games")
+
+ # Base features
+ X = pd.DataFrame()
+
+ # Home team features
+ X["home_adj_em"] = merged["adj_em_home"]
+ X["home_pythag"] = merged["pythag_home"]
+ X["home_adj_o"] = merged["adj_o_home"]
+ X["home_adj_d"] = merged["adj_d_home"]
+ X["home_adj_t"] = merged["adj_t_home"]
+ X["home_luck"] = merged["luck_home"]
+ X["home_sos"] = merged["sos_home"]
+ X["home_efg_pct"] = merged["efg_pct_home"]
+ X["home_to_pct"] = merged["to_pct_home"]
+ X["home_or_pct"] = merged["or_pct_home"]
+ X["home_ft_rate"] = merged["ft_rate_home"]
+
+ # Away team features
+ X["away_adj_em"] = merged["adj_em_away"]
+ X["away_pythag"] = merged["pythag_away"]
+ X["away_adj_o"] = merged["adj_o_away"]
+ X["away_adj_d"] = merged["adj_d_away"]
+ X["away_adj_t"] = merged["adj_t_away"]
+ X["away_luck"] = merged["luck_away"]
+ X["away_sos"] = merged["sos_away"]
+ X["away_efg_pct"] = merged["efg_pct_away"]
+ X["away_to_pct"] = merged["to_pct_away"]
+ X["away_or_pct"] = merged["or_pct_away"]
+ X["away_ft_rate"] = merged["ft_rate_away"]
+
+ # ========== ADVANCED FEATURES ==========
+
+ # 1. Efficiency Matchups (offense vs defense)
+ X["home_off_vs_def"] = X["home_adj_o"] - X["away_adj_d"]
+ X["away_off_vs_def"] = X["away_adj_o"] - X["home_adj_d"]
+
+ # 2. Tempo-adjusted scoring potential
+ X["home_scoring_potential"] = (X["home_adj_o"] * X["away_adj_d"] / 100) * (
+ X["home_adj_t"] / 100
+ )
+ X["away_scoring_potential"] = (X["away_adj_o"] * X["home_adj_d"] / 100) * (
+ X["away_adj_t"] / 100
+ )
+
+ # 3. Expected total and pace
+ X["expected_total"] = X["home_scoring_potential"] + X["away_scoring_potential"]
+ X["avg_tempo"] = (X["home_adj_t"] + X["away_adj_t"]) / 2
+ X["tempo_advantage"] = X["home_adj_t"] - X["away_adj_t"]
+
+ # 4. Four Factors differentials
+ X["efg_diff"] = X["home_efg_pct"] - X["away_efg_pct"]
+ X["to_diff"] = X["home_to_pct"] - X["away_to_pct"]
+ X["or_diff"] = X["home_or_pct"] - X["away_or_pct"]
+ X["ft_diff"] = X["home_ft_rate"] - X["away_ft_rate"]
+
+ # 5. Composite strength metrics
+ X["home_overall_strength"] = (
+ X["home_adj_em"] * 0.4
+ + X["home_pythag"] * 0.3
+ + X["home_adj_o"] * 0.15
+ + X["home_adj_d"] * 0.15
+ )
+ X["away_overall_strength"] = (
+ X["away_adj_em"] * 0.4
+ + X["away_pythag"] * 0.3
+ + X["away_adj_o"] * 0.15
+ + X["away_adj_d"] * 0.15
+ )
+ X["strength_diff"] = X["home_overall_strength"] - X["away_overall_strength"]
+
+ # 6. Interaction features (most predictive combinations)
+ X["home_o_x_tempo"] = X["home_adj_o"] * X["home_adj_t"]
+ X["away_o_x_tempo"] = X["away_adj_o"] * X["away_adj_t"]
+ X["efficiency_product"] = X["home_adj_em"] * X["away_adj_em"]
+ X["pythag_product"] = X["home_pythag"] * X["away_pythag"]
+
+ # 7. Luck and overperformance factors
+ X["luck_diff"] = X["home_luck"] - X["away_luck"]
+ X["total_luck"] = X["home_luck"] + X["away_luck"]
+
+ # 8. Schedule strength impact
+ X["sos_diff"] = X["home_sos"] - X["away_sos"]
+
+ # 9. Add line features if available
+ if "opening_total" in merged.columns:
+ X["opening_total"] = merged["opening_total"]
+ X["closing_total"] = merged["closing_total"]
+ X["total_movement"] = merged["closing_total"] - merged["opening_total"]
+ X["expected_vs_line"] = X["expected_total"] - X["closing_total"]
+
+ # Add targets
+ X["home_score"] = merged["home_score"]
+ X["away_score"] = merged["away_score"]
+ X["margin"] = merged["home_score"] - merged["away_score"]
+ X["total_score"] = merged["home_score"] + merged["away_score"]
+
+ # Drop missing
+ X = X.dropna(subset=["home_score", "away_score"])
+
+ logger.info(f"Final dataset: {len(X)} games, {len(X.columns) - 4} features")
+
+ return X
+
+
+def optimize_xgboost(trial: optuna.Trial) -> dict:
+ """Suggest XGBoost hyperparameters."""
+ return {
+ "max_depth": trial.suggest_int("xgb_max_depth", 3, 10),
+ "learning_rate": trial.suggest_float("xgb_learning_rate", 0.01, 0.3, log=True),
+ "min_child_weight": trial.suggest_int("xgb_min_child_weight", 1, 10),
+ "subsample": trial.suggest_float("xgb_subsample", 0.6, 1.0),
+ "colsample_bytree": trial.suggest_float("xgb_colsample_bytree", 0.6, 1.0),
+ "gamma": trial.suggest_float("xgb_gamma", 0.0, 5.0),
+ "reg_alpha": trial.suggest_float("xgb_reg_alpha", 0.0, 10.0),
+ "reg_lambda": trial.suggest_float("xgb_reg_lambda", 0.0, 10.0),
+ "n_estimators": trial.suggest_int("xgb_n_estimators", 50, 500),
+ }
+
+
+def optimize_lightgbm(trial: optuna.Trial) -> dict:
+ """Suggest LightGBM hyperparameters."""
+ return {
+ "num_leaves": trial.suggest_int("lgb_num_leaves", 10, 100),
+ "learning_rate": trial.suggest_float("lgb_learning_rate", 0.01, 0.3, log=True),
+ "min_child_samples": trial.suggest_int("lgb_min_child_samples", 5, 50),
+ "subsample": trial.suggest_float("lgb_subsample", 0.6, 1.0),
+ "colsample_bytree": trial.suggest_float("lgb_colsample_bytree", 0.6, 1.0),
+ "reg_alpha": trial.suggest_float("lgb_reg_alpha", 0.0, 10.0),
+ "reg_lambda": trial.suggest_float("lgb_reg_lambda", 0.0, 10.0),
+ "n_estimators": trial.suggest_int("lgb_n_estimators", 50, 500),
+ }
+
+
+def optimize_catboost(trial: optuna.Trial) -> dict:
+ """Suggest CatBoost hyperparameters."""
+ return {
+ "depth": trial.suggest_int("cat_depth", 3, 10),
+ "learning_rate": trial.suggest_float("cat_learning_rate", 0.01, 0.3, log=True),
+ "l2_leaf_reg": trial.suggest_float("cat_l2_leaf_reg", 0.1, 10.0),
+ "iterations": trial.suggest_int("cat_iterations", 50, 500),
+ }
+
+
+def train_base_models(
+ X_train: pd.DataFrame,
+ y_train: pd.Series,
+ X_val: pd.DataFrame,
+ y_val: pd.Series,
+ n_trials: int = 50,
+) -> dict:
+ """Train and optimize base models.
+
+ Args:
+ X_train: Training features
+ y_train: Training target
+ X_val: Validation features
+ y_val: Validation target
+ n_trials: Number of Optuna trials
+
+ Returns:
+ Dictionary of trained models
+ """
+ import lightgbm as lgb
+ import xgboost as xgb
+ from catboost import CatBoostRegressor
+
+ models = {}
+
+ # ========== XGBoost ==========
+ logger.info("Training XGBoost with Optuna...")
+
+ def xgb_objective(trial):
+ params = optimize_xgboost(trial)
+ model = xgb.XGBRegressor(
+ **params,
+ objective="reg:squarederror",
+ random_state=42,
+ n_jobs=-1,
+ )
+ model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
+ y_pred = model.predict(X_val)
+ return np.sqrt(mean_squared_error(y_val, y_pred))
+
+ xgb_study = optuna.create_study(direction="minimize", study_name="xgboost")
+ xgb_study.optimize(xgb_objective, n_trials=n_trials, show_progress_bar=True)
+
+ # Train final XGBoost
+ best_xgb_params = {k.replace("xgb_", ""): v for k, v in xgb_study.best_params.items()}
+ models["xgboost"] = xgb.XGBRegressor(
+ **best_xgb_params,
+ objective="reg:squarederror",
+ random_state=42,
+ n_jobs=-1,
+ )
+ models["xgboost"].fit(X_train, y_train)
+ logger.info(f"XGBoost Best RMSE: {xgb_study.best_value:.3f}")
+
+ # ========== LightGBM ==========
+ logger.info("Training LightGBM with Optuna...")
+
+ def lgb_objective(trial):
+ params = optimize_lightgbm(trial)
+ model = lgb.LGBMRegressor(**params, random_state=42, n_jobs=-1, verbose=-1)
+ model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
+ y_pred = model.predict(X_val)
+ return np.sqrt(mean_squared_error(y_val, y_pred))
+
+ lgb_study = optuna.create_study(direction="minimize", study_name="lightgbm")
+ lgb_study.optimize(lgb_objective, n_trials=n_trials, show_progress_bar=True)
+
+ # Train final LightGBM
+ best_lgb_params = {k.replace("lgb_", ""): v for k, v in lgb_study.best_params.items()}
+ models["lightgbm"] = lgb.LGBMRegressor(
+ **best_lgb_params, random_state=42, n_jobs=-1, verbose=-1
+ )
+ models["lightgbm"].fit(X_train, y_train)
+ logger.info(f"LightGBM Best RMSE: {lgb_study.best_value:.3f}")
+
+ # ========== CatBoost ==========
+ logger.info("Training CatBoost with Optuna...")
+
+ def cat_objective(trial):
+ params = optimize_catboost(trial)
+ model = CatBoostRegressor(**params, random_state=42, verbose=False, thread_count=-1)
+ model.fit(X_train, y_train, eval_set=(X_val, y_val))
+ y_pred = model.predict(X_val)
+ return np.sqrt(mean_squared_error(y_val, y_pred))
+
+ cat_study = optuna.create_study(direction="minimize", study_name="catboost")
+ cat_study.optimize(cat_objective, n_trials=n_trials, show_progress_bar=True)
+
+ # Train final CatBoost
+ best_cat_params = {k.replace("cat_", ""): v for k, v in cat_study.best_params.items()}
+ models["catboost"] = CatBoostRegressor(
+ **best_cat_params, random_state=42, verbose=False, thread_count=-1
+ )
+ models["catboost"].fit(X_train, y_train)
+ logger.info(f"CatBoost Best RMSE: {cat_study.best_value:.3f}")
+
+ return models
+
+
+def build_stacked_ensemble(
+ base_models: dict,
+ X_train: pd.DataFrame,
+ y_train: pd.Series,
+ X_val: pd.DataFrame,
+ y_val: pd.Series,
+) -> Ridge:
+ """Build stacked ensemble with Ridge meta-learner.
+
+ Args:
+ base_models: Dictionary of trained base models
+ X_train: Training features
+ y_train: Training target
+ X_val: Validation features
+ y_val: Validation target
+
+ Returns:
+ Trained meta-learner
+ """
+ logger.info("Building stacked ensemble...")
+
+ # Generate meta-features (predictions from base models)
+ meta_train = np.column_stack([model.predict(X_train) for model in base_models.values()])
+ meta_val = np.column_stack([model.predict(X_val) for model in base_models.values()])
+
+ # Train meta-learner
+ meta_learner = Ridge(alpha=1.0)
+ meta_learner.fit(meta_train, y_train)
+
+ # Evaluate
+ train_pred = meta_learner.predict(meta_train)
+ val_pred = meta_learner.predict(meta_val)
+
+ train_rmse = np.sqrt(mean_squared_error(y_train, train_pred))
+ val_rmse = np.sqrt(mean_squared_error(y_val, val_pred))
+
+ logger.info(f"Stacked Train RMSE: {train_rmse:.3f}")
+ logger.info(f"Stacked Val RMSE: {val_rmse:.3f}")
+
+ # Show base model weights
+ weights = meta_learner.coef_
+ logger.info("\nMeta-learner weights:")
+ for name, weight in zip(base_models.keys(), weights, strict=False):
+ logger.info(f" {name}: {weight:.3f}")
+
+ return meta_learner
+
+
+def evaluate_ensemble(
+ base_models: dict,
+ meta_learner: Ridge,
+ X: pd.DataFrame,
+ y: pd.Series,
+ name: str,
+) -> dict:
+ """Evaluate ensemble performance.
+
+ Args:
+ base_models: Dictionary of trained base models
+ meta_learner: Trained meta-learner
+ X: Features
+ y: True values
+ name: Dataset name for logging
+
+ Returns:
+ Dictionary of metrics
+ """
+ # Base model predictions
+ base_preds = np.column_stack([model.predict(X) for model in base_models.values()])
+
+ # Ensemble prediction
+ y_pred = meta_learner.predict(base_preds)
+
+ mae = mean_absolute_error(y, y_pred)
+ rmse = np.sqrt(mean_squared_error(y, y_pred))
+ r2 = r2_score(y, y_pred)
+
+ logger.info(f"\n{name} Performance:")
+ logger.info(f" MAE: {mae:.2f} points")
+ logger.info(f" RMSE: {rmse:.2f} points")
+ logger.info(f" R²: {r2:.4f}")
+
+ return {"mae": mae, "rmse": rmse, "r2": r2}
+
+
+@click.command()
+@click.option(
+ "--start-date",
+ required=True,
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ help="Start date for training data (YYYY-MM-DD)",
+)
+@click.option(
+ "--end-date",
+ required=True,
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ help="End date for training data (YYYY-MM-DD)",
+)
+@click.option(
+ "--staging-path",
+ default="data/staging",
+ type=click.Path(path_type=Path),
+ help="Path to staging data directory",
+)
+@click.option(
+ "--output-dir",
+ default="models",
+ type=click.Path(path_type=Path),
+ help="Output directory for models",
+)
+@click.option(
+ "--n-trials",
+ default=50,
+ type=int,
+ help="Number of Optuna trials per model",
+)
+def main(
+ start_date: datetime,
+ end_date: datetime,
+ staging_path: Path,
+ output_dir: Path,
+ n_trials: int,
+) -> None:
+ """Train advanced ensemble score prediction models."""
+ logger.info("=" * 80)
+ logger.info("ADVANCED ENSEMBLE SCORE PREDICTION")
+ logger.info("=" * 80)
+
+ start_str = start_date.strftime("%Y-%m-%d")
+ end_str = end_date.strftime("%Y-%m-%d")
+
+ # Build advanced features
+ df = build_advanced_features(staging_path, start_str, end_str)
+
+ # Prepare data
+ target_cols = ["home_score", "away_score", "margin", "total_score"]
+ feature_cols = [col for col in df.columns if col not in target_cols]
+
+ X = df[feature_cols]
+ y_home = df["home_score"]
+ y_away = df["away_score"]
+
+ # Split data (80/20)
+ train_size = int(0.8 * len(X))
+ X_train, X_val = X[:train_size], X[train_size:]
+ y_train_home, y_val_home = y_home[:train_size], y_home[train_size:]
+ y_train_away, y_val_away = y_away[:train_size], y_away[train_size:]
+
+ logger.info("\nDataset Info:")
+ logger.info(f" Features: {len(feature_cols)}")
+ logger.info(f" Train: {len(X_train)} games")
+ logger.info(f" Val: {len(X_val)} games")
+
+ # ========== TRAIN HOME SCORE ENSEMBLE ==========
+ logger.info("\n" + "=" * 80)
+ logger.info("TRAINING HOME SCORE ENSEMBLE")
+ logger.info("=" * 80)
+
+ home_base_models = train_base_models(X_train, y_train_home, X_val, y_val_home, n_trials)
+ home_meta = build_stacked_ensemble(home_base_models, X_train, y_train_home, X_val, y_val_home)
+
+ # ========== TRAIN AWAY SCORE ENSEMBLE ==========
+ logger.info("\n" + "=" * 80)
+ logger.info("TRAINING AWAY SCORE ENSEMBLE")
+ logger.info("=" * 80)
+
+ away_base_models = train_base_models(X_train, y_train_away, X_val, y_val_away, n_trials)
+ away_meta = build_stacked_ensemble(away_base_models, X_train, y_train_away, X_val, y_val_away)
+
+ # ========== EVALUATE ==========
+ logger.info("\n" + "=" * 80)
+ logger.info("FINAL EVALUATION")
+ logger.info("=" * 80)
+
+ home_metrics = evaluate_ensemble(
+ home_base_models, home_meta, X_val, y_val_home, "Home Score Ensemble"
+ )
+ away_metrics = evaluate_ensemble(
+ away_base_models, away_meta, X_val, y_val_away, "Away Score Ensemble"
+ )
+
+ # Derived metrics
+ home_pred = home_meta.predict(
+ np.column_stack([m.predict(X_val) for m in home_base_models.values()])
+ )
+ away_pred = away_meta.predict(
+ np.column_stack([m.predict(X_val) for m in away_base_models.values()])
+ )
+
+ margin_pred = home_pred - away_pred
+ total_pred = home_pred + away_pred
+ margin_true = y_val_home - y_val_away
+ total_true = y_val_home + y_val_away
+
+ logger.info("\nDerived Predictions:")
+ logger.info(f" Margin MAE: {mean_absolute_error(margin_true, margin_pred):.2f} pts")
+ logger.info(f" Margin RMSE: {np.sqrt(mean_squared_error(margin_true, margin_pred)):.2f} pts")
+ logger.info(f" Total MAE: {mean_absolute_error(total_true, total_pred):.2f} pts")
+ logger.info(f" Total RMSE: {np.sqrt(mean_squared_error(total_true, total_pred)):.2f} pts")
+
+ # ========== SAVE MODELS ==========
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ ensemble_home = {
+ "base_models": home_base_models,
+ "meta_learner": home_meta,
+ "feature_names": feature_cols,
+ "metrics": home_metrics,
+ }
+
+ ensemble_away = {
+ "base_models": away_base_models,
+ "meta_learner": away_meta,
+ "feature_names": feature_cols,
+ "metrics": away_metrics,
+ }
+
+ home_path = output_dir / "ensemble_home_2026.pkl"
+ away_path = output_dir / "ensemble_away_2026.pkl"
+
+ with open(home_path, "wb") as f:
+ pickle.dump(ensemble_home, f)
+ with open(away_path, "wb") as f:
+ pickle.dump(ensemble_away, f)
+
+ logger.info("\n[OK] Saved ensemble models:")
+ logger.info(f" Home: {home_path}")
+ logger.info(f" Away: {away_path}")
+
+ logger.info("\n" + "=" * 80)
+ logger.info("TRAINING COMPLETE")
+ logger.info("=" * 80)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/training/train_score_models.py b/scripts/training/train_score_models.py
new file mode 100644
index 000000000..60742c0d7
--- /dev/null
+++ b/scripts/training/train_score_models.py
@@ -0,0 +1,578 @@
+#!/usr/bin/env python3
+"""Train regression models to predict game scores and margins.
+
+This script trains XGBoost regression models to predict:
+1. Home team score
+2. Away team score
+3. Margin (home - away)
+4. Total (home + away)
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import pickle
+from datetime import datetime
+from pathlib import Path
+
+import click
+import pandas as pd
+import xgboost as xgb
+from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
+from sklearn.model_selection import train_test_split
+
+from sports_betting_edge.config.settings import settings
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+logger = logging.getLogger(__name__)
+
+# D1 average constants for expected points formula
+DI_AVG_EFF = 109.15 # D1 avg offensive/defensive efficiency (per 100 poss)
+DI_AVG_TEMPO = 67.34 # D1 avg possessions per game
+DEFAULT_HCA_PTS = 3.2 # Fallback HCA when per-team data unavailable
+
+
+def build_score_features(
+ staging_path: Path,
+ start_date: str,
+ end_date: str,
+) -> pd.DataFrame:
+ """Build features for score prediction.
+
+ Args:
+ staging_path: Path to staging data directory
+ start_date: Start date (YYYY-MM-DD)
+ end_date: End date (YYYY-MM-DD)
+
+ Returns:
+ DataFrame with features and score targets
+ """
+ from datetime import datetime as dt
+
+ engineer = FeatureEngineer(staging_path=str(staging_path))
+
+ # Load raw merged data with scores
+ logger.info(f"Building dataset from {start_date} to {end_date}...")
+ start_dt = dt.fromisoformat(start_date).date()
+ end_dt = dt.fromisoformat(end_date).date()
+
+ merged = engineer.load_staging_data(
+ start_dt,
+ end_dt,
+ season=2026,
+ require_line_features=False,
+ use_home_away=True,
+ )
+
+ logger.info(f"Loaded {len(merged)} games")
+
+ # Build features (same as totals model)
+ X = pd.DataFrame()
+
+ # Home team features
+ # Removed: pythag, adj_em, adj_o, adj_d (all captured by expected_pts formula)
+ # expected_pts = f(adj_o, opp_adj_d, tempo) encodes the efficiency matchup
+ X["home_adj_t"] = merged["adj_t_home"]
+ X["home_luck"] = merged["luck_home"]
+ X["home_sos"] = merged["sos_home"]
+ X["home_height"] = merged["height_eff_home"]
+ X["home_efg_pct"] = merged["efg_pct_home"]
+ X["home_to_pct"] = merged["to_pct_home"]
+ X["home_or_pct"] = merged["or_pct_home"]
+ X["home_ft_rate"] = merged["ft_rate_home"]
+
+ # Away team features
+ # Removed: pythag, adj_em, adj_o, adj_d (all captured by expected_pts formula)
+ X["away_adj_t"] = merged["adj_t_away"]
+ X["away_luck"] = merged["luck_away"]
+ X["away_sos"] = merged["sos_away"]
+ X["away_height"] = merged["height_eff_away"]
+ X["away_efg_pct"] = merged["efg_pct_away"]
+ X["away_to_pct"] = merged["to_pct_away"]
+ X["away_or_pct"] = merged["or_pct_away"]
+ X["away_ft_rate"] = merged["ft_rate_away"]
+
+ # Combined features
+ # NOTE: total_offense, avg_defense removed (r>0.85 with adj_o/adj_d)
+ X["avg_tempo"] = (X["home_adj_t"] + X["away_adj_t"]) / 2
+ X["avg_luck"] = (X["home_luck"] + X["away_luck"]) / 2
+ X["height_diff"] = X["home_height"] - X["away_height"]
+
+ # Expected pts computed from raw merged data (adj_o/adj_d not in X)
+ game_tempo = (merged["adj_t_home"] * merged["adj_t_away"]) / DI_AVG_TEMPO
+ # Per-team HCA (points component) or league average fallback
+ home_hca = (
+ merged["hca_pts_home"].fillna(DEFAULT_HCA_PTS)
+ if "hca_pts_home" in merged.columns
+ else DEFAULT_HCA_PTS
+ )
+ X["home_expected_pts"] = (merged["adj_o_home"] * merged["adj_d_away"] / DI_AVG_EFF) * (
+ game_tempo / 100
+ ) + (home_hca / 2)
+ X["away_expected_pts"] = (merged["adj_o_away"] * merged["adj_d_home"] / DI_AVG_EFF) * (
+ game_tempo / 100
+ ) - (home_hca / 2)
+ X["expected_total"] = X["home_expected_pts"] + X["away_expected_pts"]
+ if "hca_home" in merged.columns:
+ X["home_hca"] = merged["hca_home"].fillna(0.0)
+
+ # Differential features - REDUCED to avoid double-counting talent gap
+ # Removed: adj_em_diff, pythag_diff, adj_o_diff, adj_d_diff (r>0.85 with expected_pts)
+ # Keep tempo diff (independent signal) and Four Factors diffs (shooting quality, not efficiency)
+ X["adj_t_diff"] = X["home_adj_t"] - X["away_adj_t"]
+
+ # Four Factors differentials (shooting quality - different signal from efficiency)
+ X["efg_pct_diff"] = X["home_efg_pct"] - X["away_efg_pct"]
+ X["to_pct_diff"] = X["home_to_pct"] - X["away_to_pct"]
+ X["or_pct_diff"] = X["home_or_pct"] - X["away_or_pct"]
+ X["ft_rate_diff"] = X["home_ft_rate"] - X["away_ft_rate"]
+
+ # Contextual differentials
+ X["sos_diff"] = X["home_sos"] - X["away_sos"]
+ X["luck_diff"] = X["home_luck"] - X["away_luck"]
+
+ # Add line features if available
+ if "opening_total" in merged.columns:
+ X["opening_total"] = merged["opening_total"]
+ X["closing_total"] = merged["closing_total"]
+ X["total_movement"] = merged["closing_total"] - merged["opening_total"]
+
+ # KenPom FanMatch features (optional - XGBoost handles NaN)
+ if "kp_predicted_margin" in merged.columns:
+ X["kp_predicted_margin"] = merged["kp_predicted_margin"]
+ X["kp_predicted_total"] = merged["kp_predicted_total"]
+ X["kp_home_wp"] = merged["kp_home_wp"]
+
+ # Market disagreement features
+ if "favorite_team" in merged.columns and "closing_spread" in merged.columns:
+ is_home_fav = merged["home_team"] == merged["favorite_team"]
+ market_home_margin = (
+ merged["closing_spread"].abs().where(is_home_fav, -merged["closing_spread"].abs())
+ )
+ X["kp_market_margin_diff"] = merged["kp_predicted_margin"] - market_home_margin
+ if "closing_total" in merged.columns:
+ X["kp_market_total_diff"] = merged["kp_predicted_total"] - merged["closing_total"]
+
+ # Rest & situational features
+ X["home_rest_days"] = merged["home_rest_days"]
+ X["away_rest_days"] = merged["away_rest_days"]
+ X["home_back_to_back"] = merged["home_back_to_back"]
+ X["away_back_to_back"] = merged["away_back_to_back"]
+ X["home_short_rest"] = merged["home_short_rest"]
+ X["away_short_rest"] = merged["away_short_rest"]
+ X["away_road_streak"] = merged["away_road_streak"]
+ X["away_days_on_road"] = merged["away_days_on_road"]
+ X["rest_advantage"] = X["home_rest_days"] - X["away_rest_days"]
+ X["total_back_to_back"] = (X["home_back_to_back"] | X["away_back_to_back"]).astype(int)
+ X["total_short_rest"] = (X["home_short_rest"] | X["away_short_rest"]).astype(int)
+
+ # Add score targets
+ X["home_score"] = merged["home_score"]
+ X["away_score"] = merged["away_score"]
+ X["margin"] = merged["home_score"] - merged["away_score"]
+ X["total_score"] = merged["home_score"] + merged["away_score"]
+
+ # Drop rows with missing scores
+ X = X.dropna(subset=["home_score", "away_score"])
+
+ logger.info(f"Final dataset: {len(X)} games with complete scores")
+
+ return X
+
+
+def train_score_models(
+ X_train: pd.DataFrame,
+ X_val: pd.DataFrame,
+ y_train_home: pd.Series,
+ y_val_home: pd.Series,
+ y_train_away: pd.Series,
+ y_val_away: pd.Series,
+) -> tuple[xgb.XGBRegressor, xgb.XGBRegressor]:
+ """Train regression models for home and away scores.
+
+ Args:
+ X_train: Training features
+ X_val: Validation features
+ y_train_home: Training home scores
+ y_val_home: Validation home scores
+ y_train_away: Training away scores
+ y_val_away: Validation away scores
+
+ Returns:
+ Tuple of (home_model, away_model)
+ """
+ # Regularized parameters to prevent extreme mismatch overprediction
+ # Key changes from v1: shallower trees, higher min samples per leaf,
+ # L1/L2 regularization, and early stopping
+ params = {
+ "objective": "reg:squarederror",
+ "learning_rate": 0.05,
+ "max_depth": 4,
+ "min_child_weight": 10,
+ "subsample": 0.7,
+ "colsample_bytree": 0.6,
+ "reg_alpha": 1.0,
+ "reg_lambda": 5.0,
+ "gamma": 1.0,
+ "n_estimators": 300,
+ "early_stopping_rounds": 20,
+ "random_state": 42,
+ "n_jobs": -1,
+ }
+
+ # Train home score model
+ logger.info("Training home score model...")
+ home_model = xgb.XGBRegressor(**params)
+ home_model.fit(
+ X_train,
+ y_train_home,
+ eval_set=[(X_val, y_val_home)],
+ verbose=False,
+ )
+ logger.info(f" Home model stopped at {home_model.best_iteration} rounds")
+
+ # Train away score model
+ logger.info("Training away score model...")
+ away_model = xgb.XGBRegressor(**params)
+ away_model.fit(
+ X_train,
+ y_train_away,
+ eval_set=[(X_val, y_val_away)],
+ verbose=False,
+ )
+ logger.info(f" Away model stopped at {away_model.best_iteration} rounds")
+
+ return home_model, away_model
+
+
+def evaluate_model(
+ model: xgb.XGBRegressor,
+ X: pd.DataFrame,
+ y: pd.Series,
+ name: str,
+) -> dict[str, float]:
+ """Evaluate regression model performance.
+
+ Args:
+ model: Trained model
+ X: Features
+ y: True values
+ name: Model name for logging
+
+ Returns:
+ Dictionary of metrics
+ """
+ import numpy as np
+
+ y_pred = model.predict(X)
+
+ mae = mean_absolute_error(y, y_pred)
+ mse = mean_squared_error(y, y_pred)
+ rmse = np.sqrt(mse)
+ r2 = r2_score(y, y_pred)
+
+ logger.info(f"\n{name} Performance:")
+ logger.info(f" MAE: {mae:.2f} points")
+ logger.info(f" RMSE: {rmse:.2f} points")
+ logger.info(f" R²: {r2:.4f}")
+
+ return {"mae": mae, "rmse": rmse, "r2": r2}
+
+
+def diagnose_bias(
+ home_model: xgb.XGBRegressor,
+ away_model: xgb.XGBRegressor,
+ X_val: pd.DataFrame,
+ y_val_home: pd.Series,
+ y_val_away: pd.Series,
+) -> dict[str, float]:
+ """Diagnose systematic bias in score predictions.
+
+ Args:
+ home_model: Trained home score model
+ away_model: Trained away score model
+ X_val: Validation features
+ y_val_home: Actual home scores
+ y_val_away: Actual away scores
+
+ Returns:
+ Dictionary of bias metrics
+ """
+ import numpy as np
+
+ home_pred = home_model.predict(X_val)
+ away_pred = away_model.predict(X_val)
+
+ # Per-component bias (positive = overprediction)
+ home_bias = float(np.mean(home_pred - y_val_home))
+ away_bias = float(np.mean(away_pred - y_val_away))
+
+ # Derived metrics
+ total_pred = home_pred + away_pred
+ total_actual = np.asarray(y_val_home) + np.asarray(y_val_away)
+ total_bias = float(np.mean(total_pred - total_actual))
+ total_mae = float(np.mean(np.abs(total_pred - total_actual)))
+
+ margin_pred = home_pred - away_pred
+ margin_actual = np.asarray(y_val_home) - np.asarray(y_val_away)
+ margin_bias = float(np.mean(margin_pred - margin_actual))
+
+ # Actual averages
+ actual_total_mean = float(np.mean(total_actual))
+ pred_total_mean = float(np.mean(total_pred))
+
+ # Market comparison (if closing_total available)
+ market_total_mae = None
+ market_total_bias = None
+ if "closing_total" in X_val.columns:
+ closing = X_val["closing_total"].values
+ valid = ~np.isnan(closing)
+ if valid.sum() > 10:
+ market_total_mae = float(np.mean(np.abs(closing[valid] - total_actual[valid])))
+ market_total_bias = float(np.mean(closing[valid] - total_actual[valid]))
+
+ logger.info("\n" + "=" * 80)
+ logger.info("BIAS DIAGNOSTICS")
+ logger.info("=" * 80)
+ logger.info(f" Home score bias: {home_bias:+.2f} pts")
+ logger.info(f" Away score bias: {away_bias:+.2f} pts")
+ logger.info(f" Margin bias: {margin_bias:+.2f} pts")
+ logger.info(f" Total bias: {total_bias:+.2f} pts")
+ logger.info(f" Actual total mean: {actual_total_mean:.1f}")
+ logger.info(f" Pred total mean: {pred_total_mean:.1f}")
+ logger.info(f" Total MAE: {total_mae:.2f} pts")
+
+ if market_total_mae is not None:
+ logger.info(f" Market total MAE: {market_total_mae:.2f} pts")
+ logger.info(f" Market total bias: {market_total_bias:+.2f} pts")
+
+ # Warn if bias is large
+ if abs(total_bias) > 2.0:
+ logger.warning(f"[WARNING] Total bias of {total_bias:+.2f} exceeds +/- 2.0 threshold!")
+
+ metrics = {
+ "home_bias": home_bias,
+ "away_bias": away_bias,
+ "margin_bias": margin_bias,
+ "total_bias": total_bias,
+ "actual_total_mean": actual_total_mean,
+ "pred_total_mean": pred_total_mean,
+ "total_mae": total_mae,
+ }
+ if market_total_mae is not None and market_total_bias is not None:
+ metrics["market_total_mae"] = market_total_mae
+ metrics["market_total_bias"] = market_total_bias
+
+ return metrics
+
+
+def _season_to_dates(season: int) -> tuple[str, str]:
+ """Convert a season year to start/end dates.
+
+ Args:
+ season: Season year (e.g. 2026 = 2025-11 to 2026-04)
+
+ Returns:
+ Tuple of (start_date, end_date) as YYYY-MM-DD strings
+ """
+ start = f"{season - 1}-11-04"
+ end = datetime.now().strftime("%Y-%m-%d")
+ return start, end
+
+
+@click.command()
+@click.option(
+ "--start-date",
+ required=False,
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ help="Start date for training data (YYYY-MM-DD)",
+)
+@click.option(
+ "--end-date",
+ required=False,
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ help="End date for training data (YYYY-MM-DD)",
+)
+@click.option(
+ "--season",
+ required=False,
+ type=int,
+ help="Season year (e.g. 2026). Auto-computes start/end dates.",
+)
+@click.option(
+ "--staging-path",
+ default=str(settings.staging_dir),
+ type=click.Path(path_type=Path),
+ help="Path to staging data directory",
+)
+@click.option(
+ "--output-dir",
+ default=str(settings.models_dir),
+ type=click.Path(path_type=Path),
+ help="Output directory for models",
+)
+def main(
+ start_date: datetime | None,
+ end_date: datetime | None,
+ season: int | None,
+ staging_path: Path,
+ output_dir: Path,
+) -> None:
+ """Train score prediction models."""
+ # Configure logging
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+ )
+
+ logger.info("=" * 80)
+ logger.info("Score Prediction Model Training")
+ logger.info("=" * 80)
+
+ # Resolve dates from --season or --start-date/--end-date
+ if season is not None:
+ start_str, end_str = _season_to_dates(season)
+ logger.info(f"Season {season} -> {start_str} to {end_str}")
+ elif start_date is not None and end_date is not None:
+ start_str = start_date.strftime("%Y-%m-%d")
+ end_str = end_date.strftime("%Y-%m-%d")
+ else:
+ raise click.UsageError("Provide either --season or both --start-date and --end-date")
+
+ # Build dataset
+ df = build_score_features(staging_path, start_str, end_str)
+
+ # Select features (exclude target columns)
+ target_cols = [
+ "home_score",
+ "away_score",
+ "margin",
+ "total_score",
+ "went_over", # The target from build_totals_dataset
+ ]
+ feature_cols = [col for col in df.columns if col not in target_cols]
+
+ X = df[feature_cols]
+ y_home = df["home_score"]
+ y_away = df["away_score"]
+
+ logger.info(f"Features: {len(feature_cols)} columns")
+ logger.info(f"Samples: {len(X)} games")
+
+ # Split data
+ X_train, X_val, y_train_home, y_val_home = train_test_split(
+ X,
+ y_home,
+ test_size=0.2,
+ random_state=42,
+ )
+ _, _, y_train_away, y_val_away = train_test_split(
+ X,
+ y_away,
+ test_size=0.2,
+ random_state=42,
+ )
+
+ logger.info(f"Train: {len(X_train)} games")
+ logger.info(f"Val: {len(X_val)} games")
+
+ # Train models
+ home_model, away_model = train_score_models(
+ X_train,
+ X_val,
+ y_train_home,
+ y_val_home,
+ y_train_away,
+ y_val_away,
+ )
+
+ # Evaluate models
+ logger.info("\n" + "=" * 80)
+ logger.info("Model Evaluation")
+ logger.info("=" * 80)
+
+ home_metrics = evaluate_model(home_model, X_val, y_val_home, "Home Score")
+ away_metrics = evaluate_model(away_model, X_val, y_val_away, "Away Score")
+
+ # Compute derived metrics (margin and total)
+ import numpy as np
+
+ home_pred = home_model.predict(X_val)
+ away_pred = away_model.predict(X_val)
+ margin_pred = home_pred - away_pred
+ total_pred = home_pred + away_pred
+
+ margin_true = y_val_home - y_val_away
+ total_true = y_val_home + y_val_away
+
+ margin_mae = mean_absolute_error(margin_true, margin_pred)
+ margin_mse = mean_squared_error(margin_true, margin_pred)
+ margin_rmse = np.sqrt(margin_mse)
+ total_mae = mean_absolute_error(total_true, total_pred)
+ total_mse = mean_squared_error(total_true, total_pred)
+ total_rmse = np.sqrt(total_mse)
+
+ logger.info("\nDerived Predictions:")
+ logger.info(f" Margin MAE: {margin_mae:.2f} points")
+ logger.info(f" Margin RMSE: {margin_rmse:.2f} points")
+ logger.info(f" Total MAE: {total_mae:.2f} points")
+ logger.info(f" Total RMSE: {total_rmse:.2f} points")
+
+ # Bias diagnostics
+ bias_metrics = diagnose_bias(home_model, away_model, X_val, y_val_home, y_val_away)
+
+ # Save models
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ home_path = output_dir / "home_score_2026.pkl"
+ away_path = output_dir / "away_score_2026.pkl"
+
+ with open(home_path, "wb") as f:
+ pickle.dump(home_model, f)
+ with open(away_path, "wb") as f:
+ pickle.dump(away_model, f)
+
+ logger.info("\n[OK] Saved models:")
+ logger.info(f" Home: {home_path}")
+ logger.info(f" Away: {away_path}")
+
+ # Save feature names
+ feature_path = output_dir / "score_features.txt"
+ feature_path.write_text("\n".join(feature_cols))
+ logger.info(f" Features: {feature_path}")
+
+ # Save enhanced metadata
+ metadata = {
+ "trained_at": datetime.now().isoformat(),
+ "date_range": {"start": start_str, "end": end_str},
+ "samples": {
+ "total": len(X),
+ "train": len(X_train),
+ "val": len(X_val),
+ },
+ "features": len(feature_cols),
+ "evaluation": {
+ "home_mae": float(home_metrics["mae"]),
+ "home_rmse": float(home_metrics["rmse"]),
+ "away_mae": float(away_metrics["mae"]),
+ "away_rmse": float(away_metrics["rmse"]),
+ "margin_mae": float(margin_mae),
+ "margin_rmse": float(margin_rmse),
+ "total_mae": float(total_mae),
+ "total_rmse": float(total_rmse),
+ },
+ "bias": bias_metrics,
+ }
+ metadata_path = output_dir / "score_model_metadata.json"
+ with open(metadata_path, "w") as f:
+ json.dump(metadata, f, indent=2)
+ logger.info(f" Metadata: {metadata_path}")
+
+ logger.info("\n" + "=" * 80)
+ logger.info("Training Complete")
+ logger.info("=" * 80)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/training/train_seed_ensemble.py b/scripts/training/train_seed_ensemble.py
new file mode 100644
index 000000000..f63e12b4a
--- /dev/null
+++ b/scripts/training/train_seed_ensemble.py
@@ -0,0 +1,332 @@
+"""Train ensemble of XGBoost models with different random seeds.
+
+This script implements the Day 5-7 ensemble strategy from IMPROVEMENT_SUMMARY.md:
+- Train 5 XGBoost models with different random seeds (42, 123, 456, 789, 1024)
+- Weight predictions by validation AUC
+- Compare ensemble performance vs single best model
+
+Expected improvement: +2-4% AUC through reduced variance
+
+Usage:
+ python scripts/training/train_seed_ensemble.py --model-type spreads
+ python scripts/training/train_seed_ensemble.py --model-type totals --n-seeds 7
+"""
+
+from __future__ import annotations
+
+import logging
+import pickle
+from pathlib import Path
+from typing import Any, Literal
+
+import click
+import numpy as np
+import pandas as pd
+import xgboost as xgb
+from sklearn.metrics import accuracy_score, log_loss, roc_auc_score
+from sklearn.model_selection import train_test_split
+
+from sports_betting_edge.config.logging import configure_logging
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+class SeedEnsemble:
+ """Ensemble of models trained with different random seeds."""
+
+ def __init__(self, models: list[Any], weights: list[float]) -> None:
+ """Initialize seed ensemble.
+
+ Args:
+ models: List of trained models
+ weights: List of weights (validation AUCs)
+ """
+ self.models = models
+ self.weights = np.array(weights)
+ # Normalize weights to sum to 1
+ self.weights = self.weights / self.weights.sum()
+
+ def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
+ """Predict probabilities using weighted average.
+
+ Args:
+ X: Feature matrix
+
+ Returns:
+ Array of probabilities (positive class)
+ """
+ predictions = np.zeros(len(X))
+
+ for model, weight in zip(self.models, self.weights, strict=False):
+ pred = model.predict_proba(X)[:, 1]
+ predictions += weight * pred
+
+ return predictions
+
+ def predict(self, X: pd.DataFrame) -> np.ndarray:
+ """Predict class labels.
+
+ Args:
+ X: Feature matrix
+
+ Returns:
+ Array of class labels
+ """
+ proba = self.predict_proba(X)
+ return (proba >= 0.5).astype(int)
+
+
+def train_single_seed_model(
+ X_train: pd.DataFrame,
+ y_train: pd.Series,
+ X_val: pd.DataFrame,
+ y_val: pd.Series,
+ seed: int,
+ base_params: dict[str, Any],
+) -> tuple[Any, float]:
+ """Train a single XGBoost model with given seed.
+
+ Args:
+ X_train: Training features
+ y_train: Training labels
+ X_val: Validation features
+ y_val: Validation labels
+ seed: Random seed
+ base_params: Base model parameters
+
+ Returns:
+ Tuple of (trained model, validation AUC)
+ """
+ logger.info(f"Training model with seed={seed}...")
+
+ # Create params with seed
+ params = base_params.copy()
+ params["random_state"] = seed
+
+ # Train model
+ model = xgb.XGBClassifier(**params)
+ model.fit(
+ X_train,
+ y_train,
+ eval_set=[(X_val, y_val)],
+ verbose=False,
+ )
+
+ # Evaluate
+ y_val_pred = model.predict_proba(X_val)[:, 1]
+ val_auc = roc_auc_score(y_val, y_val_pred)
+
+ logger.info(f" Seed {seed} validation AUC: {val_auc:.4f}")
+
+ return model, val_auc
+
+
+@click.command()
+@click.option(
+ "--model-type",
+ type=click.Choice(["spreads", "totals"]),
+ required=True,
+ help="Type of model to train",
+)
+@click.option(
+ "--n-seeds",
+ type=int,
+ default=5,
+ help="Number of models with different seeds",
+)
+@click.option(
+ "--start-date",
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ default="2025-11-04",
+ help="Start date for training data",
+)
+@click.option(
+ "--end-date",
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ default="2026-02-05",
+ help="End date for training data",
+)
+@click.option(
+ "--season",
+ type=int,
+ default=2026,
+ help="KenPom season year",
+)
+@click.option(
+ "--output-dir",
+ type=click.Path(path_type=Path),
+ default="data/models",
+ help="Output directory for models",
+)
+def main(
+ model_type: Literal["spreads", "totals"],
+ n_seeds: int,
+ start_date: Any,
+ end_date: Any,
+ season: int,
+ output_dir: Path,
+) -> None:
+ """Train ensemble of XGBoost models with different random seeds."""
+ logger.info(f"[OK] === Training {model_type.upper()} Seed Ensemble ===\n")
+ logger.info("Configuration:")
+ logger.info(f" Model type: {model_type}")
+ logger.info(f" Number of seeds: {n_seeds}")
+ logger.info(f" Date range: {start_date.date()} to {end_date.date()}")
+ logger.info(f" Season: {season}\n")
+
+ # Create output directory
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ # Load data
+ logger.info("Loading training data...")
+ engineer = FeatureEngineer(staging_path="data/staging/")
+
+ if model_type == "spreads":
+ X, y = engineer.build_spreads_dataset(
+ start_date=start_date.date(),
+ end_date=end_date.date(),
+ season=season,
+ )
+ else: # totals
+ X, y = engineer.build_totals_dataset(
+ start_date=start_date.date(),
+ end_date=end_date.date(),
+ season=season,
+ )
+
+ logger.info(f"Dataset: {len(X)} samples, {len(X.columns)} features")
+ logger.info(f"Positive rate: {y.mean():.2%}\n")
+
+ if len(X) == 0:
+ logger.error("No training data found. Exiting.")
+ return
+
+ # Train/val split
+ X_train, X_val, y_train, y_val = train_test_split(
+ X, y, test_size=0.2, random_state=42, stratify=y
+ )
+ logger.info(f"Train: {len(X_train)}, Val: {len(X_val)}\n")
+
+ # Load best hyperparameters from previous tuning
+ # These are from the optimized models (spreads_2026_optimized.pkl, totals_2026_optimized.pkl)
+ if model_type == "spreads":
+ # From Trial #5 in ENSEMBLE_TRAINING_SUMMARY.md
+ base_params = {
+ "n_estimators": 500,
+ "max_depth": 10,
+ "learning_rate": 0.2442,
+ "min_child_weight": 9,
+ "gamma": 2.9895,
+ "reg_alpha": 4.6094,
+ "reg_lambda": 0.4425,
+ "subsample": 0.5980,
+ "colsample_bytree": 0.5226,
+ "colsample_bylevel": 0.7966,
+ "objective": "binary:logistic",
+ "eval_metric": "logloss",
+ "early_stopping_rounds": 20,
+ }
+ else: # totals
+ # From Trial #46 in ENSEMBLE_TRAINING_SUMMARY.md
+ base_params = {
+ "n_estimators": 400,
+ "max_depth": 11,
+ "learning_rate": 0.045,
+ "min_child_weight": 5,
+ "gamma": 4.2,
+ "reg_alpha": 4.3,
+ "reg_lambda": 0.7,
+ "subsample": 0.88,
+ "colsample_bytree": 0.94,
+ "objective": "binary:logistic",
+ "eval_metric": "logloss",
+ "early_stopping_rounds": 20,
+ }
+
+ # Generate seeds (use prime numbers for better randomness)
+ seeds = [42, 123, 456, 789, 1024, 2048, 4096, 8192, 16384][:n_seeds]
+ logger.info(f"Training {n_seeds} models with seeds: {seeds}\n")
+
+ # Train models
+ models = []
+ val_aucs = []
+
+ for seed in seeds:
+ model, val_auc = train_single_seed_model(X_train, y_train, X_val, y_val, seed, base_params)
+ models.append(model)
+ val_aucs.append(val_auc)
+
+ # Create ensemble
+ logger.info("\n[OK] === Ensemble Results ===\n")
+ ensemble = SeedEnsemble(models=models, weights=val_aucs)
+
+ # Evaluate individual models
+ logger.info("Individual model performance:")
+ for i, (seed, auc) in enumerate(zip(seeds, val_aucs, strict=False)):
+ weight_pct = ensemble.weights[i] * 100
+ logger.info(f" Seed {seed:5d}: AUC={auc:.4f}, Weight={weight_pct:.1f}%")
+
+ # Evaluate ensemble
+ logger.info("\nEnsemble performance:")
+ ensemble_pred = ensemble.predict_proba(X_val)
+ ensemble_auc = roc_auc_score(y_val, ensemble_pred)
+ ensemble_acc = accuracy_score(y_val, (ensemble_pred >= 0.5).astype(int))
+ ensemble_logloss = log_loss(y_val, ensemble_pred)
+
+ logger.info(f" Validation AUC: {ensemble_auc:.4f}")
+ logger.info(f" Validation Accuracy: {ensemble_acc:.2%}")
+ logger.info(f" Validation Log Loss: {ensemble_logloss:.4f}")
+
+ # Compare to best single model
+ best_single_auc = max(val_aucs)
+ best_single_idx = val_aucs.index(best_single_auc)
+ improvement = ensemble_auc - best_single_auc
+ improvement_pct = improvement / best_single_auc * 100
+
+ logger.info(f"\nComparison to best single model (seed {seeds[best_single_idx]}):")
+ logger.info(f" Best single AUC: {best_single_auc:.4f}")
+ logger.info(f" Ensemble AUC: {ensemble_auc:.4f}")
+ logger.info(f" Improvement: {improvement:+.4f} ({improvement_pct:+.2f}%)")
+
+ # Save ensemble
+ output_path = output_dir / f"{model_type}_2026_seed_ensemble.pkl"
+ with open(output_path, "wb") as f:
+ pickle.dump(ensemble, f)
+
+ logger.info(f"\n[OK] Saved ensemble to {output_path}")
+
+ # Save metadata
+ metadata = {
+ "model_type": model_type,
+ "n_seeds": n_seeds,
+ "seeds": seeds,
+ "val_aucs": val_aucs,
+ "ensemble_auc": ensemble_auc,
+ "ensemble_accuracy": ensemble_acc,
+ "ensemble_logloss": ensemble_logloss,
+ "best_single_auc": best_single_auc,
+ "improvement": improvement,
+ "improvement_pct": improvement_pct,
+ "weights": ensemble.weights.tolist(),
+ "date_range": {
+ "start": str(start_date.date()),
+ "end": str(end_date.date()),
+ },
+ "season": season,
+ "n_samples": len(X),
+ "n_features": len(X.columns),
+ }
+
+ metadata_path = output_dir / f"{model_type}_2026_seed_ensemble_metadata.json"
+ import json
+
+ with open(metadata_path, "w") as f:
+ json.dump(metadata, indent=2, fp=f)
+
+ logger.info(f"[OK] Saved metadata to {metadata_path}\n")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/training/train_seed_ensemble_legacy.py b/scripts/training/train_seed_ensemble_legacy.py
new file mode 100644
index 000000000..5a3c0d4ac
--- /dev/null
+++ b/scripts/training/train_seed_ensemble_legacy.py
@@ -0,0 +1,303 @@
+"""Train ensemble of XGBoost models with different random seeds using legacy data.
+
+Uses complete_dataset.parquet which has scores + line features for 436 games.
+
+Usage:
+ python scripts/training/train_seed_ensemble_legacy.py --model-type spreads
+ python scripts/training/train_seed_ensemble_legacy.py --model-type totals
+"""
+
+from __future__ import annotations
+
+import logging
+import pickle
+from pathlib import Path
+from typing import Any, Literal
+
+import click
+import numpy as np
+import pandas as pd
+import xgboost as xgb
+from sklearn.metrics import accuracy_score, log_loss, roc_auc_score
+from sklearn.model_selection import train_test_split
+
+from sports_betting_edge.adapters.filesystem import read_parquet_df
+from sports_betting_edge.config.logging import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+class SeedEnsemble:
+ """Ensemble of models trained with different random seeds."""
+
+ def __init__(self, models: list[Any], weights: list[float]) -> None:
+ """Initialize seed ensemble.
+
+ Args:
+ models: List of trained models
+ weights: List of weights (validation AUCs)
+ """
+ self.models = models
+ self.weights = np.array(weights)
+ # Normalize weights to sum to 1
+ self.weights = self.weights / self.weights.sum()
+
+ def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
+ """Predict probabilities using weighted average.
+
+ Args:
+ X: Feature matrix
+
+ Returns:
+ Array of probabilities (positive class)
+ """
+ predictions = np.zeros(len(X))
+
+ for model, weight in zip(self.models, self.weights, strict=False):
+ pred = model.predict_proba(X)[:, 1]
+ predictions += weight * pred
+
+ return predictions
+
+ def predict(self, X: pd.DataFrame) -> np.ndarray:
+ """Predict class labels.
+
+ Args:
+ X: Feature matrix
+
+ Returns:
+ Array of class labels
+ """
+ proba = self.predict_proba(X)
+ return (proba >= 0.5).astype(int)
+
+
+def train_single_seed_model(
+ X_train: pd.DataFrame,
+ y_train: pd.Series,
+ X_val: pd.DataFrame,
+ y_val: pd.Series,
+ seed: int,
+ base_params: dict[str, Any],
+) -> tuple[Any, float]:
+ """Train a single XGBoost model with given seed.
+
+ Args:
+ X_train: Training features
+ y_train: Training labels
+ X_val: Validation features
+ y_val: Validation labels
+ seed: Random seed
+ base_params: Base model parameters
+
+ Returns:
+ Tuple of (trained model, validation AUC)
+ """
+ logger.info(f"Training model with seed={seed}...")
+
+ # Create params with seed
+ params = base_params.copy()
+ params["random_state"] = seed
+
+ # Train model
+ model = xgb.XGBClassifier(**params)
+ model.fit(
+ X_train,
+ y_train,
+ eval_set=[(X_val, y_val)],
+ verbose=False,
+ )
+
+ # Evaluate
+ y_val_pred = model.predict_proba(X_val)[:, 1]
+ val_auc = roc_auc_score(y_val, y_val_pred)
+
+ logger.info(f" Seed {seed} validation AUC: {val_auc:.4f}")
+
+ return model, val_auc
+
+
+@click.command()
+@click.option(
+ "--model-type",
+ type=click.Choice(["spreads", "totals"]),
+ required=True,
+ help="Type of model to train",
+)
+@click.option(
+ "--n-seeds",
+ type=int,
+ default=5,
+ help="Number of models with different seeds",
+)
+@click.option(
+ "--output-dir",
+ type=click.Path(path_type=Path),
+ default="data/models",
+ help="Output directory for models",
+)
+def main(
+ model_type: Literal["spreads", "totals"],
+ n_seeds: int,
+ output_dir: Path,
+) -> None:
+ """Train ensemble of XGBoost models with different random seeds."""
+ logger.info(f"[OK] === Training {model_type.upper()} Seed Ensemble ===\n")
+ logger.info("Configuration:")
+ logger.info(f" Model type: {model_type}")
+ logger.info(f" Number of seeds: {n_seeds}\n")
+
+ # Create output directory
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ # Load legacy complete dataset
+ logger.info("Loading complete_dataset.parquet...")
+ df = read_parquet_df("data/staging/complete_dataset.parquet")
+ logger.info(f"Loaded {len(df)} games from {df['game_date'].min()} to {df['game_date'].max()}\n")
+
+ # Define features based on what's available in complete_dataset
+ # Basic features: line features + game outcomes
+ if model_type == "spreads":
+ feature_cols = [
+ "consensus_opening_spread_magnitude",
+ "consensus_closing_spread_magnitude",
+ "opening_spread_range",
+ "closing_spread_range",
+ "num_books_spread",
+ "spread_magnitude_movement",
+ "opening_home_implied_prob",
+ "closing_home_implied_prob",
+ "home_is_favorite",
+ ]
+ label_col = "home_covered_spread"
+ else: # totals
+ feature_cols = [
+ "consensus_opening_total",
+ "consensus_closing_total",
+ "opening_total_range",
+ "closing_total_range",
+ "num_books_total",
+ "total_movement",
+ ]
+ label_col = "went_over"
+
+ # Filter to games with required features
+ required_cols = feature_cols + [label_col]
+ df_clean = df.dropna(subset=required_cols)
+
+ logger.info(f"Dataset after filtering for {model_type} features:")
+ logger.info(f" {len(df_clean)} games ({len(df_clean) / len(df) * 100:.1f}% coverage)")
+ logger.info(f" Features: {len(feature_cols)}")
+ logger.info(f" Positive rate: {df_clean[label_col].mean():.2%}\n")
+
+ if len(df_clean) == 0:
+ logger.error("No training data found. Exiting.")
+ return
+
+ X = df_clean[feature_cols]
+ y = df_clean[label_col]
+
+ # Train/val split
+ X_train, X_val, y_train, y_val = train_test_split(
+ X, y, test_size=0.2, random_state=42, stratify=y
+ )
+ logger.info(f"Train: {len(X_train)}, Val: {len(X_val)}\n")
+
+ # Use simple parameters (no hyperparameter tuning)
+ base_params = {
+ "n_estimators": 300,
+ "max_depth": 6,
+ "learning_rate": 0.1,
+ "min_child_weight": 5,
+ "gamma": 1.0,
+ "reg_alpha": 1.0,
+ "reg_lambda": 1.0,
+ "subsample": 0.8,
+ "colsample_bytree": 0.8,
+ "objective": "binary:logistic",
+ "eval_metric": "logloss",
+ "early_stopping_rounds": 20,
+ }
+
+ # Generate seeds
+ seeds = [42, 123, 456, 789, 1024, 2048, 4096, 8192, 16384][:n_seeds]
+ logger.info(f"Training {n_seeds} models with seeds: {seeds}\n")
+
+ # Train models
+ models = []
+ val_aucs = []
+
+ for seed in seeds:
+ model, val_auc = train_single_seed_model(X_train, y_train, X_val, y_val, seed, base_params)
+ models.append(model)
+ val_aucs.append(val_auc)
+
+ # Create ensemble
+ logger.info("\n[OK] === Ensemble Results ===\n")
+ ensemble = SeedEnsemble(models=models, weights=val_aucs)
+
+ # Evaluate individual models
+ logger.info("Individual model performance:")
+ for i, (seed, auc) in enumerate(zip(seeds, val_aucs, strict=False)):
+ weight_pct = ensemble.weights[i] * 100
+ logger.info(f" Seed {seed:5d}: AUC={auc:.4f}, Weight={weight_pct:.1f}%")
+
+ # Evaluate ensemble
+ logger.info("\nEnsemble performance:")
+ ensemble_pred = ensemble.predict_proba(X_val)
+ ensemble_auc = roc_auc_score(y_val, ensemble_pred)
+ ensemble_acc = accuracy_score(y_val, (ensemble_pred >= 0.5).astype(int))
+ ensemble_logloss = log_loss(y_val, ensemble_pred)
+
+ logger.info(f" Validation AUC: {ensemble_auc:.4f}")
+ logger.info(f" Validation Accuracy: {ensemble_acc:.2%}")
+ logger.info(f" Validation Log Loss: {ensemble_logloss:.4f}")
+
+ # Compare to best single model
+ best_single_auc = max(val_aucs)
+ best_single_idx = val_aucs.index(best_single_auc)
+ improvement = ensemble_auc - best_single_auc
+ improvement_pct = improvement / best_single_auc * 100
+
+ logger.info(f"\nComparison to best single model (seed {seeds[best_single_idx]}):")
+ logger.info(f" Best single AUC: {best_single_auc:.4f}")
+ logger.info(f" Ensemble AUC: {ensemble_auc:.4f}")
+ logger.info(f" Improvement: {improvement:+.4f} ({improvement_pct:+.2f}%)")
+
+ # Save ensemble
+ output_path = output_dir / f"{model_type}_2026_seed_ensemble_legacy.pkl"
+ with open(output_path, "wb") as f:
+ pickle.dump(ensemble, f)
+
+ logger.info(f"\n[OK] Saved ensemble to {output_path}")
+
+ # Save metadata
+ metadata = {
+ "model_type": model_type,
+ "n_seeds": n_seeds,
+ "seeds": seeds,
+ "val_aucs": val_aucs,
+ "ensemble_auc": ensemble_auc,
+ "ensemble_accuracy": ensemble_acc,
+ "ensemble_logloss": ensemble_logloss,
+ "best_single_auc": best_single_auc,
+ "improvement": improvement,
+ "improvement_pct": improvement_pct,
+ "weights": ensemble.weights.tolist(),
+ "n_samples": len(df_clean),
+ "n_features": len(feature_cols),
+ "features": feature_cols,
+ }
+
+ metadata_path = output_dir / f"{model_type}_2026_seed_ensemble_legacy_metadata.json"
+ import json
+
+ with open(metadata_path, "w") as f:
+ json.dump(metadata, indent=2, fp=f)
+
+ logger.info(f"[OK] Saved metadata to {metadata_path}\n")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/training/train_spreads_ensemble.py b/scripts/training/train_spreads_ensemble.py
new file mode 100644
index 000000000..6bcf60f43
--- /dev/null
+++ b/scripts/training/train_spreads_ensemble.py
@@ -0,0 +1,163 @@
+"""Train ensemble of diverse models for spreads prediction.
+
+This script trains an ensemble combining XGBoost, LightGBM, and Random Forest
+models to improve prediction performance through model diversity.
+
+Usage:
+ python scripts/training/train_spreads_ensemble.py
+
+The script will:
+1. Load training data from staging layer
+2. Train three diverse models (XGBoost, LightGBM, Random Forest)
+3. Compare ensemble strategies (simple, weighted, stacking)
+4. Save the best performing ensemble
+
+Expected improvement: +3-5% AUC over single best model
+"""
+
+from __future__ import annotations
+
+import logging
+import pickle
+from datetime import date
+from pathlib import Path
+
+from sklearn.model_selection import train_test_split
+
+from sports_betting_edge.config.logging import configure_logging
+from sports_betting_edge.services.ensemble import EnsembleTrainer
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+def main() -> None:
+ """Train spreads ensemble model."""
+ logger.info("[OK] === Training Spreads Ensemble ===\n")
+
+ # Configuration
+ START_DATE = date(2025, 11, 4)
+ END_DATE = date(2026, 2, 5)
+ SEASON = 2026
+ OUTPUT_DIR = Path("data/models")
+ OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+
+ # Load data
+ logger.info(f"Loading training data from {START_DATE} to {END_DATE}...")
+ engineer = FeatureEngineer(staging_path="data/staging/")
+ X, y = engineer.build_spreads_dataset(
+ start_date=START_DATE,
+ end_date=END_DATE,
+ season=SEASON,
+ )
+
+ logger.info(f"Dataset: {len(X)} samples, {len(X.columns)} features")
+ logger.info(f"Cover rate: {y.mean():.2%}\n")
+
+ if len(X) == 0:
+ logger.error("No training data found. Exiting.")
+ return
+
+ # Train/val split
+ X_train, X_val, y_train, y_val = train_test_split(
+ X, y, test_size=0.2, random_state=42, stratify=y
+ )
+ logger.info(f"Train: {len(X_train)}, Val: {len(X_val)}\n")
+
+ # Model parameters (use optimized parameters from hyperparameter tuning)
+ xgb_params = {
+ "n_estimators": 500,
+ "max_depth": 10,
+ "learning_rate": 0.2442,
+ "min_child_weight": 9,
+ "gamma": 2.9895,
+ "reg_alpha": 4.6094,
+ "reg_lambda": 0.4425,
+ "subsample": 0.5980,
+ "colsample_bytree": 0.5226,
+ "colsample_bylevel": 0.7966,
+ }
+
+ lgb_params = {
+ "n_estimators": 500,
+ "max_depth": 10,
+ "learning_rate": 0.05,
+ "num_leaves": 31,
+ "min_child_samples": 20,
+ "reg_alpha": 4.0,
+ "reg_lambda": 2.0,
+ "subsample": 0.6,
+ "colsample_bytree": 0.6,
+ }
+
+ rf_params = {
+ "n_estimators": 500,
+ "max_depth": 15,
+ "min_samples_split": 30,
+ "min_samples_leaf": 15,
+ "max_features": "sqrt",
+ }
+
+ model_params = {
+ "xgboost": xgb_params,
+ "lightgbm": lgb_params,
+ "rf": rf_params,
+ }
+
+ # Train ensembles with different strategies
+ trainer = EnsembleTrainer()
+
+ strategies = ["simple", "weighted", "stacking"]
+ ensembles = {}
+
+ for strategy in strategies:
+ logger.info(f"\n[OK] === Training {strategy.upper()} ensemble ===\n")
+ ensemble = trainer.train(
+ X_train,
+ y_train,
+ X_val,
+ y_val,
+ models=["xgboost", "lightgbm", "rf"],
+ strategy=strategy,
+ model_params=model_params,
+ )
+ ensembles[strategy] = ensemble
+
+ # Find best ensemble
+ logger.info("\n[OK] === Ensemble Comparison ===\n")
+ best_strategy = None
+ best_auc = 0.0
+
+ from sklearn.metrics import roc_auc_score
+
+ for strategy, ensemble in ensembles.items():
+ y_val_pred = ensemble.predict_proba(X_val)
+ auc = roc_auc_score(y_val, y_val_pred)
+ logger.info(f"{strategy.capitalize()} ensemble AUC: {auc:.4f}")
+
+ if auc > best_auc:
+ best_auc = auc
+ best_strategy = strategy
+
+ logger.info(f"\n[OK] Best ensemble: {best_strategy} (AUC: {best_auc:.4f})")
+
+ # Save best ensemble
+ output_path = OUTPUT_DIR / "spreads_2026_ensemble.pkl"
+ with open(output_path, "wb") as f:
+ pickle.dump(ensembles[best_strategy], f)
+
+ logger.info(f"[OK] Saved best ensemble to {output_path}\n")
+
+ # Save all ensembles for comparison
+ for strategy, ensemble in ensembles.items():
+ output_path = OUTPUT_DIR / f"spreads_2026_ensemble_{strategy}.pkl"
+ with open(output_path, "wb") as f:
+ pickle.dump(ensemble, f)
+ logger.info(f"[OK] Saved {strategy} ensemble to {output_path}")
+
+ logger.info("\n[OK] === Ensemble Training Complete ===")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/training/train_spreads_model.py b/scripts/training/train_spreads_model.py
new file mode 100644
index 000000000..02f0ee2d9
--- /dev/null
+++ b/scripts/training/train_spreads_model.py
@@ -0,0 +1,550 @@
+"""Comprehensive spreads model training pipeline.
+
+Trains XGBoost model for NCAA Men's Basketball spreads prediction with:
+- Line movement features from streaming odds data
+- Bayesian hyperparameter optimization (Optuna)
+- Probability calibration for Kelly criterion
+- Walk-forward validation with temporal stability checks
+- SHAP-based feature selection
+- Comprehensive metrics tracking
+
+Usage:
+ python scripts/train_spreads_model.py --start-date 2025-12-28 --end-date 2026-02-01
+ python scripts/train_spreads_model.py --tune --n-trials 100
+ python scripts/train_spreads_model.py --feature-selection --top-k 30
+ python scripts/train_spreads_model.py --save-model models/spreads_2026.pkl
+
+Example:
+ # Full pipeline with all features
+ python scripts/train_spreads_model.py \\
+ --start-date 2025-12-28 \\
+ --end-date 2026-02-01 \\
+ --tune \\
+ --n-trials 50 \\
+ --feature-selection \\
+ --top-k 30 \\
+ --save-model models/spreads_final.pkl \\
+ --output-dir data/outputs/results/spreads_training
+"""
+
+import logging
+import pickle
+from datetime import date, datetime
+from pathlib import Path
+from typing import Any
+
+import click
+import pandas as pd
+import xgboost as xgb
+
+from sports_betting_edge.config.logging import configure_logging
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+from sports_betting_edge.services.feature_selection import SHAPFeatureSelector
+from sports_betting_edge.services.hyperparameter_tuning import (
+ XGBoostHyperparameterTuner,
+)
+from sports_betting_edge.services.model_calibration import (
+ calibrate_model,
+ compare_calibration,
+ evaluate_calibration,
+)
+from sports_betting_edge.services.walk_forward_validation import (
+ WalkForwardValidator,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def load_training_data(
+ engineer: FeatureEngineer,
+ start_date: date,
+ end_date: date,
+ season: int = 2026,
+) -> tuple[pd.DataFrame, pd.Series]:
+ """Load and prepare training data.
+
+ Args:
+ engineer: Feature engineer
+ start_date: Start date for training data
+ end_date: End date for training data
+ season: KenPom season year
+
+ Returns:
+ Tuple of (X, y) features and labels
+ """
+ logger.info(f"Loading training data from {start_date} to {end_date}...")
+
+ # Build dataset using staging layer
+ X, y = engineer.build_spreads_dataset(
+ start_date=start_date,
+ end_date=end_date,
+ season=season,
+ )
+
+ if len(X) == 0:
+ logger.warning("No training data found")
+ return X, y
+
+ logger.info(f"Loaded {len(X)} samples with {len(X.columns)} features")
+ logger.info(f"Cover rate: {y.mean():.2%}")
+
+ return X, y
+
+
+def tune_hyperparameters(
+ X_train: pd.DataFrame,
+ y_train: pd.Series,
+ X_val: pd.DataFrame,
+ y_val: pd.Series,
+ n_trials: int = 100,
+ output_dir: Path | None = None,
+) -> dict[str, Any]:
+ """Run hyperparameter optimization.
+
+ Args:
+ X_train: Training features
+ y_train: Training labels
+ X_val: Validation features
+ y_val: Validation labels
+ n_trials: Number of Optuna trials
+ output_dir: Optional directory to save study results
+
+ Returns:
+ Best hyperparameters
+ """
+ logger.info(f"Starting hyperparameter tuning with {n_trials} trials...")
+
+ tuner = XGBoostHyperparameterTuner(
+ X_train=X_train,
+ y_train=y_train,
+ X_val=X_val,
+ y_val=y_val,
+ n_trials=n_trials,
+ study_name="spreads_tuning",
+ )
+
+ best_params = tuner.optimize()
+
+ # Save study if output directory provided
+ if output_dir:
+ output_dir.mkdir(parents=True, exist_ok=True)
+ tuner.save_study(output_dir / "optuna_study.pkl")
+ tuner.plot_optimization_history(output_dir / "optimization_history.html")
+ tuner.plot_param_importances(output_dir / "param_importances.html")
+
+ # Save tuning summary
+ summary_df = tuner.get_tuning_summary()
+ summary_df.to_csv(output_dir / "tuning_summary.csv", index=False)
+ logger.info(f"Saved tuning results to {output_dir}")
+
+ return best_params
+
+
+def train_and_calibrate(
+ X_train: pd.DataFrame,
+ y_train: pd.Series,
+ X_val: pd.DataFrame,
+ y_val: pd.Series,
+ params: dict[str, Any],
+ use_calibration: bool = True,
+) -> tuple[Any, pd.DataFrame | None]:
+ """Train model and apply calibration.
+
+ Args:
+ X_train: Training features
+ y_train: Training labels
+ X_val: Validation features
+ y_val: Validation labels
+ params: XGBoost parameters
+ use_calibration: Whether to calibrate probabilities
+
+ Returns:
+ Tuple of (model, calibration_comparison_df)
+ """
+ logger.info("Training XGBoost model...")
+
+ # Add fixed parameters
+ model_params = {
+ **params,
+ "objective": "binary:logistic",
+ "eval_metric": "logloss",
+ "random_state": 42,
+ "early_stopping_rounds": 20,
+ }
+
+ model = xgb.XGBClassifier(**model_params)
+ model.fit(
+ X_train,
+ y_train,
+ eval_set=[(X_val, y_val)],
+ verbose=False,
+ )
+
+ logger.info(f"Best iteration: {model.best_iteration}")
+
+ if use_calibration:
+ logger.info("Calibrating probabilities...")
+
+ # Get uncalibrated predictions
+ y_val_proba_uncal = model.predict_proba(X_val)[:, 1]
+
+ # Calibrate
+ calibrated_model = calibrate_model(model, X_val, y_val, method="isotonic")
+
+ # Get calibrated predictions
+ y_val_proba_cal = calibrated_model.predict_proba(X_val)[:, 1]
+
+ # Compare
+ comparison_df = compare_calibration(y_val, y_val_proba_uncal, y_val_proba_cal)
+
+ logger.info("Calibration complete")
+ return calibrated_model, comparison_df
+
+ return model, None
+
+
+def select_features(
+ model: Any,
+ X_train: pd.DataFrame,
+ X_val: pd.DataFrame,
+ method: str = "importance",
+ top_k: int = 30,
+ output_dir: Path | None = None,
+) -> list[str]:
+ """Perform SHAP-based feature selection.
+
+ Args:
+ model: Trained model
+ X_train: Training features
+ X_val: Validation features
+ method: Selection method ("importance", "cumulative", "correlation")
+ top_k: Number of features to select
+ output_dir: Optional directory to save SHAP results
+
+ Returns:
+ List of selected feature names
+ """
+ logger.info("Running SHAP-based feature selection...")
+
+ selector = SHAPFeatureSelector(
+ model=model,
+ X_train=X_train,
+ X_val=X_val,
+ background_samples=100,
+ )
+
+ # Generate reports
+ if output_dir:
+ output_dir.mkdir(parents=True, exist_ok=True)
+ selector.generate_summary_report(output_dir / "shap_summary.json")
+ selector.plot_feature_importance(
+ top_k=top_k, output_path=output_dir / "feature_importance.png"
+ )
+ selector.plot_shap_summary(output_path=output_dir / "shap_summary.png")
+
+ # Select features
+ selected_features = selector.select_features(
+ method=method,
+ top_k=top_k if method == "importance" else None, # type: ignore[arg-type]
+ )
+
+ logger.info(f"Selected {len(selected_features)} features")
+ return selected_features
+
+
+def run_walk_forward_validation(
+ engineer: FeatureEngineer,
+ params: dict[str, Any],
+ start_date: date,
+ end_date: date,
+ output_dir: Path | None = None,
+) -> tuple[pd.DataFrame, WalkForwardValidator]:
+ """Run walk-forward validation.
+
+ Args:
+ engineer: Feature engineer
+ params: Model parameters
+ start_date: Start date for validation
+ end_date: End date for validation
+ output_dir: Optional directory to save results
+
+ Returns:
+ Tuple of (DataFrame with validation results, WalkForwardValidator instance)
+ """
+ logger.info("Running walk-forward validation...")
+
+ validator = WalkForwardValidator(
+ train_window_days=30,
+ test_window_days=7,
+ step_days=7,
+ window_type="rolling",
+ )
+
+ results_df = validator.validate_spreads(
+ engineer=engineer,
+ model_params=params,
+ start_date=start_date,
+ end_date=end_date,
+ use_calibration=True,
+ )
+
+ if output_dir:
+ output_dir.mkdir(parents=True, exist_ok=True)
+ results_df.to_csv(output_dir / "walkforward_results.csv", index=False)
+ logger.info(f"Saved walk-forward results to {output_dir}")
+
+ return results_df, validator
+
+
+@click.command()
+@click.option(
+ "--start-date",
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ required=True,
+ help="Start date for training data (YYYY-MM-DD)",
+)
+@click.option(
+ "--end-date",
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ required=True,
+ help="End date for training data (YYYY-MM-DD)",
+)
+@click.option(
+ "--tune/--no-tune",
+ default=False,
+ help="Run hyperparameter tuning",
+)
+@click.option(
+ "--n-trials",
+ type=int,
+ default=50,
+ help="Number of Optuna trials (default: 50)",
+)
+@click.option(
+ "--feature-selection/--no-feature-selection",
+ default=False,
+ help="Run SHAP feature selection",
+)
+@click.option(
+ "--top-k",
+ type=int,
+ default=30,
+ help="Number of features to select (default: 30)",
+)
+@click.option(
+ "--walkforward/--no-walkforward",
+ default=True,
+ help="Run walk-forward validation (default: True)",
+)
+@click.option(
+ "--save-model",
+ type=click.Path(),
+ help="Path to save final trained model",
+)
+@click.option(
+ "--output-dir",
+ type=click.Path(),
+ default="data/outputs/results/spreads_training",
+ help="Output directory for results",
+)
+@click.option(
+ "--staging-path",
+ type=click.Path(exists=True),
+ default="data/staging",
+ help="Path to staging data directory",
+)
+@click.option(
+ "--season",
+ type=int,
+ default=2026,
+ help="KenPom season year",
+)
+def main(
+ start_date: datetime,
+ end_date: datetime,
+ tune: bool,
+ n_trials: int,
+ feature_selection: bool,
+ top_k: int,
+ walkforward: bool,
+ save_model: str | None,
+ output_dir: str,
+ staging_path: str,
+ season: int,
+) -> None:
+ """Comprehensive spreads model training pipeline."""
+ configure_logging()
+
+ logger.info("=== Spreads Model Training Pipeline ===")
+ logger.info(f"Training period: {start_date.date()} to {end_date.date()}")
+
+ output_path = Path(output_dir)
+ output_path.mkdir(parents=True, exist_ok=True)
+
+ # Initialize feature engineer with staging layer
+ engineer = FeatureEngineer(staging_path=staging_path)
+
+ # Load training data
+ X_full, y_full = load_training_data(
+ engineer=engineer,
+ start_date=start_date.date(),
+ end_date=end_date.date(),
+ season=season,
+ )
+
+ # Train/val split (80/20)
+ split_idx = int(len(X_full) * 0.8)
+ X_train = X_full.iloc[:split_idx]
+ y_train = y_full.iloc[:split_idx]
+ X_val = X_full.iloc[split_idx:]
+ y_val = y_full.iloc[split_idx:]
+
+ logger.info(f"Train: {len(X_train)}, Val: {len(X_val)}")
+
+ # Hyperparameter tuning
+ if tune:
+ best_params = tune_hyperparameters(
+ X_train, y_train, X_val, y_val, n_trials, output_path / "tuning"
+ )
+ else:
+ # Use default parameters
+ logger.info("Using default parameters (no tuning)")
+ best_params = {
+ "n_estimators": 200,
+ "max_depth": 6,
+ "learning_rate": 0.1,
+ "min_child_weight": 1,
+ "gamma": 0.0,
+ "reg_alpha": 0.0,
+ "reg_lambda": 1.0,
+ "subsample": 1.0,
+ "colsample_bytree": 1.0,
+ "colsample_bylevel": 1.0,
+ }
+
+ # Feature selection (must run before calibration - SHAP doesn't support calibrated models)
+ if feature_selection:
+ # Train uncalibrated model for feature selection
+ logger.info("Training uncalibrated model for feature selection...")
+ uncal_model, _ = train_and_calibrate(
+ X_train, y_train, X_val, y_val, best_params, use_calibration=False
+ )
+
+ selected_features = select_features(
+ uncal_model,
+ X_train,
+ X_val,
+ method="importance",
+ top_k=top_k,
+ output_dir=output_path / "feature_selection",
+ )
+
+ # Retrain with selected features and calibrate
+ logger.info("Retraining with selected features...")
+ X_train_selected = X_train[selected_features]
+ X_val_selected = X_val[selected_features]
+
+ model, calibration_df = train_and_calibrate(
+ X_train_selected,
+ y_train,
+ X_val_selected,
+ y_val,
+ best_params,
+ use_calibration=True,
+ )
+
+ # Save selected features list
+ with open(output_path / "selected_features.txt", "w") as f:
+ for feature in selected_features:
+ f.write(f"{feature}\n")
+ else:
+ # Train and calibrate with all features
+ model, calibration_df = train_and_calibrate(
+ X_train, y_train, X_val, y_val, best_params, use_calibration=True
+ )
+
+ if calibration_df is not None:
+ calibration_df.to_csv(output_path / "calibration_comparison.csv", index=False)
+
+ # Walk-forward validation
+ if walkforward:
+ wf_results, validator = run_walk_forward_validation(
+ engineer=engineer,
+ params=best_params,
+ start_date=start_date.date(),
+ end_date=end_date.date(),
+ output_dir=output_path / "walkforward",
+ )
+
+ if len(wf_results) > 0:
+ logger.info("\n=== Walk-Forward Validation Summary ===")
+ logger.info(f"Mean Test AUC: {wf_results['test_auc'].mean():.4f}")
+ logger.info(f"Std Test AUC: {wf_results['test_auc'].std():.4f}")
+ else:
+ date_range_days = (end_date.date() - start_date.date()).days
+ min_days = validator.train_window_days + validator.test_window_days + 1
+ logger.warning(
+ "\n=== Walk-Forward Validation Skipped ===\n"
+ "Date range too short for configured window sizes:\n"
+ f" - Date range: {start_date.date()} to {end_date.date()} "
+ f"({date_range_days} days)\n"
+ f" - Required: train_window ({validator.train_window_days} days) + "
+ f"test_window ({validator.test_window_days} days) + "
+ f"1 day gap = {min_days} days minimum\n"
+ "Either extend the date range or reduce window sizes."
+ )
+
+ # Evaluate final model
+ logger.info("\n=== Final Model Evaluation ===")
+ y_val_proba = model.predict_proba(X_val)[:, 1]
+ from sklearn.metrics import log_loss, roc_auc_score
+
+ val_auc = roc_auc_score(y_val, y_val_proba)
+ val_logloss = log_loss(y_val, y_val_proba)
+
+ logger.info(f"Validation AUC: {val_auc:.4f}")
+ logger.info(f"Validation LogLoss: {val_logloss:.4f}")
+
+ # Evaluate calibration
+ cal_metrics = evaluate_calibration(y_val, y_val_proba)
+ logger.info(f"Brier Score: {cal_metrics['brier_score']:.4f}")
+ logger.info(f"ECE: {cal_metrics['expected_calibration_error']:.4f}")
+
+ # Save final model
+ if save_model:
+ model_path = Path(save_model)
+ model_path.parent.mkdir(parents=True, exist_ok=True)
+
+ with open(model_path, "wb") as f:
+ pickle.dump(model, f)
+
+ logger.info(f"Saved final model to {model_path}")
+
+ # Save metadata
+ metadata = {
+ "training_period": {
+ "start": start_date.date().isoformat(),
+ "end": end_date.date().isoformat(),
+ },
+ "n_samples": len(X_full),
+ "n_features": len(X_train.columns),
+ "hyperparameters": best_params,
+ "validation_metrics": {
+ "auc": val_auc,
+ "logloss": val_logloss,
+ "brier_score": cal_metrics["brier_score"],
+ "ece": cal_metrics["expected_calibration_error"],
+ },
+ }
+
+ import json
+
+ with open(model_path.parent / "model_metadata.json", "w") as f:
+ json.dump(metadata, f, indent=2)
+
+ logger.info("\n=== Training Pipeline Complete ===")
+ logger.info(f"Results saved to: {output_path}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/training/train_totals_ensemble.py b/scripts/training/train_totals_ensemble.py
new file mode 100644
index 000000000..ca22b11e6
--- /dev/null
+++ b/scripts/training/train_totals_ensemble.py
@@ -0,0 +1,156 @@
+"""Train ensemble of diverse models for totals prediction.
+
+This script trains an ensemble combining XGBoost, LightGBM, and Random Forest
+models to improve prediction performance through model diversity.
+
+Usage:
+ python scripts/training/train_totals_ensemble.py
+"""
+
+from __future__ import annotations
+
+import logging
+import pickle
+from datetime import date
+from pathlib import Path
+
+from sklearn.model_selection import train_test_split
+
+from sports_betting_edge.config.logging import configure_logging
+from sports_betting_edge.services.ensemble import EnsembleTrainer
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+def main() -> None:
+ """Train totals ensemble model."""
+ logger.info("[OK] === Training Totals Ensemble ===\n")
+
+ # Configuration
+ START_DATE = date(2025, 11, 4)
+ END_DATE = date(2026, 2, 5)
+ SEASON = 2026
+ OUTPUT_DIR = Path("data/models")
+ OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+
+ # Load data
+ logger.info(f"Loading training data from {START_DATE} to {END_DATE}...")
+ engineer = FeatureEngineer(staging_path="data/staging/")
+ X, y = engineer.build_totals_dataset(
+ start_date=START_DATE,
+ end_date=END_DATE,
+ season=SEASON,
+ )
+
+ logger.info(f"Dataset: {len(X)} samples, {len(X.columns)} features")
+ logger.info(f"Over rate: {y.mean():.2%}\n")
+
+ if len(X) == 0:
+ logger.error("No training data found. Exiting.")
+ return
+
+ # Train/val split
+ X_train, X_val, y_train, y_val = train_test_split(
+ X, y, test_size=0.2, random_state=42, stratify=y
+ )
+ logger.info(f"Train: {len(X_train)}, Val: {len(X_val)}\n")
+
+ # Model parameters (use optimized parameters from hyperparameter tuning)
+ # XGBoost params from tuning (Trial #46)
+ xgb_params = {
+ "n_estimators": 400,
+ "max_depth": 11,
+ "learning_rate": 0.045,
+ "min_child_weight": 5,
+ "gamma": 4.2,
+ "reg_alpha": 4.3,
+ "reg_lambda": 0.7,
+ "subsample": 0.88,
+ "colsample_bytree": 0.94,
+ "colsample_bylevel": 0.74,
+ }
+
+ lgb_params = {
+ "n_estimators": 400,
+ "max_depth": 11,
+ "learning_rate": 0.045,
+ "num_leaves": 31,
+ "min_child_samples": 20,
+ "reg_alpha": 4.0,
+ "reg_lambda": 1.0,
+ "subsample": 0.88,
+ "colsample_bytree": 0.94,
+ }
+
+ rf_params = {
+ "n_estimators": 400,
+ "max_depth": 15,
+ "min_samples_split": 30,
+ "min_samples_leaf": 15,
+ "max_features": "sqrt",
+ }
+
+ model_params = {
+ "xgboost": xgb_params,
+ "lightgbm": lgb_params,
+ "rf": rf_params,
+ }
+
+ # Train ensembles with different strategies
+ trainer = EnsembleTrainer()
+
+ strategies = ["simple", "weighted", "stacking"]
+ ensembles = {}
+
+ for strategy in strategies:
+ logger.info(f"\n[OK] === Training {strategy.upper()} ensemble ===\n")
+ ensemble = trainer.train(
+ X_train,
+ y_train,
+ X_val,
+ y_val,
+ models=["xgboost", "lightgbm", "rf"],
+ strategy=strategy,
+ model_params=model_params,
+ )
+ ensembles[strategy] = ensemble
+
+ # Find best ensemble
+ logger.info("\n[OK] === Ensemble Comparison ===\n")
+ best_strategy = None
+ best_auc = 0.0
+
+ from sklearn.metrics import roc_auc_score
+
+ for strategy, ensemble in ensembles.items():
+ y_val_pred = ensemble.predict_proba(X_val)
+ auc = roc_auc_score(y_val, y_val_pred)
+ logger.info(f"{strategy.capitalize()} ensemble AUC: {auc:.4f}")
+
+ if auc > best_auc:
+ best_auc = auc
+ best_strategy = strategy
+
+ logger.info(f"\n[OK] Best ensemble: {best_strategy} (AUC: {best_auc:.4f})")
+
+ # Save best ensemble
+ output_path = OUTPUT_DIR / "totals_2026_ensemble.pkl"
+ with open(output_path, "wb") as f:
+ pickle.dump(ensembles[best_strategy], f)
+
+ logger.info(f"[OK] Saved best ensemble to {output_path}\n")
+
+ # Save all ensembles for comparison
+ for strategy, ensemble in ensembles.items():
+ output_path = OUTPUT_DIR / f"totals_2026_ensemble_{strategy}.pkl"
+ with open(output_path, "wb") as f:
+ pickle.dump(ensemble, f)
+ logger.info(f"[OK] Saved {strategy} ensemble to {output_path}")
+
+ logger.info("\n[OK] === Ensemble Training Complete ===")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/training/train_totals_model.py b/scripts/training/train_totals_model.py
new file mode 100644
index 000000000..b77781971
--- /dev/null
+++ b/scripts/training/train_totals_model.py
@@ -0,0 +1,559 @@
+"""Comprehensive totals model training pipeline.
+
+Trains XGBoost model for NCAA Men's Basketball totals (over/under) prediction with:
+- Line movement features from streaming odds data
+- Bayesian hyperparameter optimization (Optuna)
+- Probability calibration for Kelly criterion
+- Walk-forward validation with temporal stability checks
+- SHAP-based feature selection
+- Comprehensive metrics tracking
+
+Usage:
+ python scripts/train_totals_model.py --start-date 2025-12-28 --end-date 2026-02-01
+ python scripts/train_totals_model.py --tune --n-trials 100
+ python scripts/train_totals_model.py --feature-selection --top-k 20
+ python scripts/train_totals_model.py --save-model models/totals_2026.pkl
+
+Example:
+ # Full pipeline with all features
+ python scripts/train_totals_model.py \
+ --start-date 2025-12-28 \
+ --end-date 2026-02-01 \
+ --tune \
+ --n-trials 50 \
+ --feature-selection \
+ --top-k 20 \
+ --save-model models/totals_final.pkl \
+ --output-dir data/outputs/results/totals_training
+"""
+
+import logging
+import pickle
+from datetime import date, datetime
+from pathlib import Path
+from typing import Any
+
+import click
+import pandas as pd
+import xgboost as xgb
+
+from sports_betting_edge.config.logging import configure_logging
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+from sports_betting_edge.services.feature_selection import SHAPFeatureSelector
+from sports_betting_edge.services.hyperparameter_tuning import (
+ XGBoostHyperparameterTuner,
+)
+from sports_betting_edge.services.model_calibration import (
+ calibrate_model,
+ compare_calibration,
+ evaluate_calibration,
+)
+from sports_betting_edge.services.walk_forward_validation import (
+ WalkForwardValidator,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def load_training_data(
+ engineer: FeatureEngineer,
+ start_date: date,
+ end_date: date,
+ season: int = 2026,
+) -> tuple[pd.DataFrame, pd.Series]:
+ """Load and prepare training data.
+
+ Args:
+ engineer: Feature engineer
+ start_date: Start date for training data
+ end_date: End date for training data
+ season: KenPom season year
+
+ Returns:
+ Tuple of (X, y) features and labels
+ """
+ logger.info(f"Loading training data from {start_date} to {end_date}...")
+
+ # Build dataset using staging layer
+ X, y = engineer.build_totals_dataset(
+ start_date=start_date,
+ end_date=end_date,
+ season=season,
+ )
+
+ if len(X) == 0:
+ logger.warning("No training data found")
+ return X, y
+
+ logger.info(f"Loaded {len(X)} samples with {len(X.columns)} features")
+ logger.info(f"Over rate: {y.mean():.2%}")
+
+ return X, y
+
+
+def tune_hyperparameters(
+ X_train: pd.DataFrame,
+ y_train: pd.Series,
+ X_val: pd.DataFrame,
+ y_val: pd.Series,
+ n_trials: int = 100,
+ output_dir: Path | None = None,
+) -> dict[str, Any]:
+ """Run hyperparameter optimization.
+
+ Args:
+ X_train: Training features
+ y_train: Training labels
+ X_val: Validation features
+ y_val: Validation labels
+ n_trials: Number of Optuna trials
+ output_dir: Optional directory to save study results
+
+ Returns:
+ Best hyperparameters
+ """
+ logger.info(f"Starting hyperparameter tuning with {n_trials} trials...")
+
+ tuner = XGBoostHyperparameterTuner(
+ X_train=X_train,
+ y_train=y_train,
+ X_val=X_val,
+ y_val=y_val,
+ n_trials=n_trials,
+ study_name="totals_tuning",
+ )
+
+ best_params = tuner.optimize()
+
+ # Save study if output directory provided
+ if output_dir:
+ output_dir.mkdir(parents=True, exist_ok=True)
+ tuner.save_study(output_dir / "optuna_study.pkl")
+ tuner.plot_optimization_history(output_dir / "optimization_history.html")
+ tuner.plot_param_importances(output_dir / "param_importances.html")
+
+ # Save tuning summary
+ summary_df = tuner.get_tuning_summary()
+ summary_df.to_csv(output_dir / "tuning_summary.csv", index=False)
+ logger.info(f"Saved tuning results to {output_dir}")
+
+ return best_params
+
+
+def train_and_calibrate(
+ X_train: pd.DataFrame,
+ y_train: pd.Series,
+ X_val: pd.DataFrame,
+ y_val: pd.Series,
+ params: dict[str, Any],
+ use_calibration: bool = True,
+) -> tuple[Any, pd.DataFrame | None]:
+ """Train model and apply calibration.
+
+ Args:
+ X_train: Training features
+ y_train: Training labels
+ X_val: Validation features
+ y_val: Validation labels
+ params: XGBoost parameters
+ use_calibration: Whether to calibrate probabilities
+
+ Returns:
+ Tuple of (model, calibration_comparison_df)
+ """
+ logger.info("Training XGBoost model...")
+
+ # Add fixed parameters
+ model_params = {
+ **params,
+ "objective": "binary:logistic",
+ "eval_metric": "logloss",
+ "random_state": 42,
+ "early_stopping_rounds": 20,
+ }
+
+ model = xgb.XGBClassifier(**model_params)
+ model.fit(
+ X_train,
+ y_train,
+ eval_set=[(X_val, y_val)],
+ verbose=False,
+ )
+
+ logger.info(f"Best iteration: {model.best_iteration}")
+
+ if use_calibration:
+ logger.info("Calibrating probabilities...")
+
+ # Get uncalibrated predictions
+ y_val_proba_uncal = model.predict_proba(X_val)[:, 1]
+
+ # Calibrate
+ calibrated_model = calibrate_model(model, X_val, y_val, method="isotonic")
+
+ # Get calibrated predictions
+ y_val_proba_cal = calibrated_model.predict_proba(X_val)[:, 1]
+
+ # Compare
+ comparison_df = compare_calibration(y_val, y_val_proba_uncal, y_val_proba_cal)
+
+ logger.info("Calibration complete")
+ return calibrated_model, comparison_df
+
+ return model, None
+
+
+def select_features(
+ model: Any,
+ X_train: pd.DataFrame,
+ X_val: pd.DataFrame,
+ method: str = "importance",
+ top_k: int = 20,
+ output_dir: Path | None = None,
+) -> list[str]:
+ """Perform SHAP-based feature selection.
+
+ Args:
+ model: Trained model
+ X_train: Training features
+ X_val: Validation features
+ method: Selection method ("importance", "cumulative", "correlation")
+ top_k: Number of features to select
+ output_dir: Optional directory to save SHAP results
+
+ Returns:
+ List of selected feature names
+ """
+ logger.info("Running SHAP-based feature selection...")
+
+ selector = SHAPFeatureSelector(
+ model=model,
+ X_train=X_train,
+ X_val=X_val,
+ background_samples=100,
+ )
+
+ # Generate reports
+ if output_dir:
+ output_dir.mkdir(parents=True, exist_ok=True)
+ selector.generate_summary_report(output_dir / "shap_summary.json")
+ selector.plot_feature_importance(
+ top_k=top_k, output_path=output_dir / "feature_importance.png"
+ )
+ selector.plot_shap_summary(output_path=output_dir / "shap_summary.png")
+
+ # Select features
+ selected_features = selector.select_features(
+ method=method,
+ top_k=top_k if method == "importance" else None, # type: ignore[arg-type]
+ )
+
+ logger.info(f"Selected {len(selected_features)} features")
+ return selected_features
+
+
+def run_walk_forward_validation(
+ engineer: FeatureEngineer,
+ params: dict[str, Any],
+ start_date: date,
+ end_date: date,
+ output_dir: Path | None = None,
+) -> tuple[pd.DataFrame, WalkForwardValidator]:
+ """Run walk-forward validation.
+
+ Args:
+ engineer: Feature engineer
+ params: Model parameters
+ start_date: Start date for validation
+ end_date: End date for validation
+ output_dir: Optional directory to save results
+
+ Returns:
+ Tuple of (DataFrame with validation results, WalkForwardValidator instance)
+ """
+ logger.info("Running walk-forward validation...")
+
+ validator = WalkForwardValidator(
+ train_window_days=30,
+ test_window_days=7,
+ step_days=7,
+ window_type="rolling",
+ )
+
+ results_df = validator.validate_totals(
+ engineer=engineer,
+ model_params=params,
+ start_date=start_date,
+ end_date=end_date,
+ use_calibration=True,
+ )
+
+ if output_dir:
+ output_dir.mkdir(parents=True, exist_ok=True)
+ results_df.to_csv(output_dir / "walkforward_results.csv", index=False)
+ logger.info(f"Saved walk-forward results to {output_dir}")
+
+ return results_df, validator
+
+
+@click.command()
+@click.option(
+ "--start-date",
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ required=True,
+ help="Start date for training data (YYYY-MM-DD)",
+)
+@click.option(
+ "--end-date",
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ required=True,
+ help="End date for training data (YYYY-MM-DD)",
+)
+@click.option(
+ "--tune/--no-tune",
+ default=False,
+ help="Run hyperparameter tuning",
+)
+@click.option(
+ "--n-trials",
+ type=int,
+ default=50,
+ help="Number of Optuna trials (default: 50)",
+)
+@click.option(
+ "--feature-selection/--no-feature-selection",
+ default=False,
+ help="Run SHAP feature selection",
+)
+@click.option(
+ "--top-k",
+ type=int,
+ default=20,
+ help="Number of features to select (default: 20)",
+)
+@click.option(
+ "--walkforward/--no-walkforward",
+ default=True,
+ help="Run walk-forward validation (default: True)",
+)
+@click.option(
+ "--save-model",
+ type=click.Path(),
+ help="Path to save final trained model",
+)
+@click.option(
+ "--output-dir",
+ type=click.Path(),
+ default="data/outputs/results/totals_training",
+ help="Output directory for results",
+)
+@click.option(
+ "--staging-path",
+ type=click.Path(exists=True),
+ default="data/staging",
+ help="Path to staging data directory",
+)
+@click.option(
+ "--season",
+ type=int,
+ default=2026,
+ help="KenPom season year",
+)
+def main(
+ start_date: datetime,
+ end_date: datetime,
+ tune: bool,
+ n_trials: int,
+ feature_selection: bool,
+ top_k: int,
+ walkforward: bool,
+ save_model: str | None,
+ output_dir: str,
+ staging_path: str,
+ season: int,
+) -> None:
+ """Comprehensive totals model training pipeline."""
+ configure_logging()
+
+ logger.info("=== Totals Model Training Pipeline ===")
+ logger.info(f"Training period: {start_date.date()} to {end_date.date()}")
+
+ output_path = Path(output_dir)
+ output_path.mkdir(parents=True, exist_ok=True)
+
+ # Initialize feature engineer with staging layer
+ engineer = FeatureEngineer(staging_path=staging_path)
+
+ # Load training data
+ X_full, y_full = load_training_data(
+ engineer=engineer,
+ start_date=start_date.date(),
+ end_date=end_date.date(),
+ season=season,
+ )
+
+ # Train/val split (80/20)
+ split_idx = int(len(X_full) * 0.8)
+ X_train = X_full.iloc[:split_idx]
+ y_train = y_full.iloc[:split_idx]
+ X_val = X_full.iloc[split_idx:]
+ y_val = y_full.iloc[split_idx:]
+
+ logger.info(f"Train: {len(X_train)}, Val: {len(X_val)}")
+
+ # Hyperparameter tuning
+ if tune:
+ best_params = tune_hyperparameters(
+ X_train, y_train, X_val, y_val, n_trials, output_path / "tuning"
+ )
+ else:
+ # Use default parameters with regularization
+ logger.info("Using default parameters (no tuning)")
+
+ # Calculate class weight to handle imbalance
+ n_negative = (y_train == 0).sum()
+ n_positive = (y_train == 1).sum()
+ scale_pos_weight = n_negative / n_positive
+ logger.info(f"Class distribution: {n_positive} OVER, {n_negative} UNDER")
+ logger.info(f"Using scale_pos_weight: {scale_pos_weight:.2f}")
+
+ best_params = {
+ "n_estimators": 200,
+ "max_depth": 3,
+ "learning_rate": 0.05,
+ "min_child_weight": 5,
+ "gamma": 0.5,
+ "reg_alpha": 1.0,
+ "reg_lambda": 2.0,
+ "subsample": 0.8,
+ "colsample_bytree": 0.8,
+ "colsample_bylevel": 1.0,
+ "scale_pos_weight": scale_pos_weight,
+ }
+
+ # Feature selection (must run before calibration - SHAP doesn't support calibrated models)
+ if feature_selection:
+ # Train uncalibrated model for feature selection
+ logger.info("Training uncalibrated model for feature selection...")
+ uncal_model, _ = train_and_calibrate(
+ X_train, y_train, X_val, y_val, best_params, use_calibration=False
+ )
+
+ selected_features = select_features(
+ uncal_model,
+ X_train,
+ X_val,
+ method="importance",
+ top_k=top_k,
+ output_dir=output_path / "feature_selection",
+ )
+
+ # Retrain with selected features and calibrate
+ logger.info("Retraining with selected features...")
+ X_train_selected = X_train[selected_features]
+ X_val_selected = X_val[selected_features]
+
+ model, calibration_df = train_and_calibrate(
+ X_train_selected,
+ y_train,
+ X_val_selected,
+ y_val,
+ best_params,
+ use_calibration=True,
+ )
+
+ # Save selected features list
+ with open(output_path / "selected_features.txt", "w") as f:
+ for feature in selected_features:
+ f.write(f"{feature}\n")
+ else:
+ # Train and calibrate with all features
+ model, calibration_df = train_and_calibrate(
+ X_train, y_train, X_val, y_val, best_params, use_calibration=True
+ )
+
+ if calibration_df is not None:
+ calibration_df.to_csv(output_path / "calibration_comparison.csv", index=False)
+
+ # Walk-forward validation
+ if walkforward:
+ wf_results, validator = run_walk_forward_validation(
+ engineer=engineer,
+ params=best_params,
+ start_date=start_date.date(),
+ end_date=end_date.date(),
+ output_dir=output_path / "walkforward",
+ )
+
+ if len(wf_results) > 0:
+ logger.info("\n=== Walk-Forward Validation Summary ===")
+ logger.info(f"Mean Test AUC: {wf_results['test_auc'].mean():.4f}")
+ logger.info(f"Std Test AUC: {wf_results['test_auc'].std():.4f}")
+ else:
+ date_range_days = (end_date.date() - start_date.date()).days
+ min_days = validator.train_window_days + validator.test_window_days + 1
+ logger.warning(
+ "\n=== Walk-Forward Validation Skipped ===\n"
+ "Date range too short for configured window sizes:\n"
+ f" - Date range: {start_date.date()} to {end_date.date()} "
+ f"({date_range_days} days)\n"
+ f" - Required: train_window ({validator.train_window_days} days) + "
+ f"test_window ({validator.test_window_days} days) + "
+ f"1 day gap = {min_days} days minimum\n"
+ "Either extend the date range or reduce window sizes."
+ )
+
+ # Evaluate final model
+ logger.info("\n=== Final Model Evaluation ===")
+ y_val_proba = model.predict_proba(X_val)[:, 1]
+ from sklearn.metrics import log_loss, roc_auc_score
+
+ val_auc = roc_auc_score(y_val, y_val_proba)
+ val_logloss = log_loss(y_val, y_val_proba)
+
+ logger.info(f"Validation AUC: {val_auc:.4f}")
+ logger.info(f"Validation LogLoss: {val_logloss:.4f}")
+
+ # Evaluate calibration
+ cal_metrics = evaluate_calibration(y_val, y_val_proba)
+ logger.info(f"Brier Score: {cal_metrics['brier_score']:.4f}")
+ logger.info(f"ECE: {cal_metrics['expected_calibration_error']:.4f}")
+
+ # Save final model
+ if save_model:
+ model_path = Path(save_model)
+ model_path.parent.mkdir(parents=True, exist_ok=True)
+
+ with open(model_path, "wb") as f:
+ pickle.dump(model, f)
+
+ logger.info(f"Saved final model to {model_path}")
+
+ # Save metadata
+ metadata = {
+ "training_period": {
+ "start": start_date.date().isoformat(),
+ "end": end_date.date().isoformat(),
+ },
+ "n_samples": len(X_full),
+ "n_features": len(X_train.columns),
+ "hyperparameters": best_params,
+ "validation_metrics": {
+ "auc": val_auc,
+ "logloss": val_logloss,
+ "brier_score": cal_metrics["brier_score"],
+ "ece": cal_metrics["expected_calibration_error"],
+ },
+ }
+
+ import json
+
+ with open(model_path.parent / "model_metadata.json", "w") as f:
+ json.dump(metadata, f, indent=2)
+
+ logger.info("\n=== Training Pipeline Complete ===")
+ logger.info(f"Results saved to: {output_path}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/training/train_totals_regression.py b/scripts/training/train_totals_regression.py
new file mode 100644
index 000000000..d1a64b840
--- /dev/null
+++ b/scripts/training/train_totals_regression.py
@@ -0,0 +1,369 @@
+#!/usr/bin/env python3
+"""Train residual regression model for totals (over/under) prediction.
+
+Predicts: actual_total - closing_total (the market's error).
+Formula: predicted_total = closing_total + model_residual
+
+This anchors predictions to the market closing line (MAE ~14.1) and
+focuses model capacity on what the market misses: rest, KenPom
+disagreement, and matchup dynamics.
+
+Usage:
+ uv run python scripts/training/train_totals_regression.py \
+ --start-date 2025-12-01 --end-date 2026-02-05
+
+ # With holdout split for backtesting
+ uv run python scripts/training/train_totals_regression.py \
+ --start-date 2025-12-01 --end-date 2026-01-31 \
+ --holdout-start 2026-02-01 --holdout-end 2026-02-05
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import pickle
+from datetime import datetime
+from pathlib import Path
+
+import click
+import numpy as np
+import pandas as pd
+import xgboost as xgb
+from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
+from sklearn.model_selection import train_test_split
+
+from sports_betting_edge.services.feature_engineering import FeatureEngineer
+
+logger = logging.getLogger(__name__)
+
+
+def train_residual_model(
+ X_train: pd.DataFrame,
+ y_train: pd.Series,
+ X_val: pd.DataFrame,
+ y_val: pd.Series,
+) -> xgb.XGBRegressor:
+ """Train XGBoost regressor for totals residual.
+
+ Conservative hyperparameters for small dataset (~400 samples).
+
+ Args:
+ X_train: Training features
+ y_train: Training residuals (actual - closing)
+ X_val: Validation features
+ y_val: Validation residuals
+
+ Returns:
+ Trained XGBoost model
+ """
+ params = {
+ "objective": "reg:squarederror",
+ "learning_rate": 0.05,
+ "max_depth": 4,
+ "min_child_weight": 10,
+ "subsample": 0.8,
+ "colsample_bytree": 0.7,
+ "reg_alpha": 2.0,
+ "reg_lambda": 3.0,
+ "n_estimators": 200,
+ "random_state": 42,
+ "n_jobs": -1,
+ }
+
+ model = xgb.XGBRegressor(**params)
+ model.fit(
+ X_train,
+ y_train,
+ eval_set=[(X_val, y_val)],
+ verbose=False,
+ )
+
+ return model
+
+
+def evaluate_residual_model(
+ model: xgb.XGBRegressor,
+ X: pd.DataFrame,
+ y: pd.Series,
+ closing_totals: pd.Series,
+ actual_totals: pd.Series,
+ label: str,
+) -> dict[str, float]:
+ """Evaluate residual model vs market baseline.
+
+ Args:
+ model: Trained residual model
+ X: Features
+ y: True residuals (actual - closing)
+ closing_totals: Market closing totals
+ actual_totals: Actual game totals
+ label: Label for logging (e.g., "Validation")
+
+ Returns:
+ Dictionary of evaluation metrics
+ """
+ y_pred = model.predict(X)
+
+ # Residual model metrics
+ residual_mae = mean_absolute_error(y, y_pred)
+ residual_rmse = np.sqrt(mean_squared_error(y, y_pred))
+ residual_r2 = r2_score(y, y_pred)
+
+ # Market baseline: residual of 0 (just use closing line)
+ market_mae = mean_absolute_error(actual_totals, closing_totals)
+
+ # Model predicted total
+ predicted_totals = closing_totals + y_pred
+ model_total_mae = mean_absolute_error(actual_totals, predicted_totals)
+
+ # Directional accuracy: does the model predict O/U correctly?
+ # If residual > 0, model predicts over; if < 0, predicts under
+ model_direction = y_pred > 0
+ actual_direction = y > 0
+ directional_accuracy = (model_direction == actual_direction).mean()
+
+ # Standard deviation for probability calibration
+ residual_std = y.std()
+
+ logger.info(f"\n{'=' * 60}")
+ logger.info(f"{label} Results ({len(y)} games)")
+ logger.info(f"{'=' * 60}")
+ logger.info(f" Residual MAE: {residual_mae:.2f} pts")
+ logger.info(f" Residual RMSE: {residual_rmse:.2f} pts")
+ logger.info(f" Residual R2: {residual_r2:.4f}")
+ logger.info(f" Market Total MAE: {market_mae:.2f} pts (baseline)")
+ logger.info(f" Model Total MAE: {model_total_mae:.2f} pts")
+ improvement = market_mae - model_total_mae
+ logger.info(f" Improvement: {improvement:+.2f} pts vs market")
+ logger.info(f" Directional Acc: {directional_accuracy:.1%}")
+ logger.info(f" Residual Std: {residual_std:.2f} pts")
+
+ return {
+ "residual_mae": residual_mae,
+ "residual_rmse": residual_rmse,
+ "residual_r2": residual_r2,
+ "market_mae": market_mae,
+ "model_total_mae": model_total_mae,
+ "improvement": improvement,
+ "directional_accuracy": directional_accuracy,
+ "residual_std": residual_std,
+ }
+
+
+def log_feature_importance(
+ model: xgb.XGBRegressor,
+ feature_names: list[str],
+ top_n: int = 15,
+) -> None:
+ """Log top feature importances."""
+ importances = model.feature_importances_
+ sorted_idx = np.argsort(importances)[::-1]
+
+ logger.info(f"\nTop {top_n} Feature Importances:")
+ for rank, idx in enumerate(sorted_idx[:top_n], 1):
+ logger.info(f" {rank:2d}. {feature_names[idx]:30s} {importances[idx]:.4f}")
+
+
+@click.command()
+@click.option(
+ "--start-date",
+ required=True,
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ help="Start date for training data (YYYY-MM-DD)",
+)
+@click.option(
+ "--end-date",
+ required=True,
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ help="End date for training data (YYYY-MM-DD)",
+)
+@click.option(
+ "--holdout-start",
+ default=None,
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ help="Holdout start date for backtesting (YYYY-MM-DD)",
+)
+@click.option(
+ "--holdout-end",
+ default=None,
+ type=click.DateTime(formats=["%Y-%m-%d"]),
+ help="Holdout end date for backtesting (YYYY-MM-DD)",
+)
+@click.option(
+ "--staging-path",
+ default="data/staging",
+ type=click.Path(path_type=Path),
+ help="Path to staging data directory",
+)
+@click.option(
+ "--output-dir",
+ default="models",
+ type=click.Path(path_type=Path),
+ help="Output directory for models",
+)
+def main(
+ start_date: datetime,
+ end_date: datetime,
+ holdout_start: datetime | None,
+ holdout_end: datetime | None,
+ staging_path: Path,
+ output_dir: Path,
+) -> None:
+ """Train totals residual regression model."""
+ logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+ )
+
+ logger.info("=" * 80)
+ logger.info("Totals Residual Regression Training")
+ logger.info("=" * 80)
+
+ start_str = start_date.strftime("%Y-%m-%d")
+ end_str = end_date.strftime("%Y-%m-%d")
+
+ # Build training dataset
+ engineer = FeatureEngineer(staging_path=str(staging_path))
+ X, y = engineer.build_totals_residual_dataset(
+ start_date=start_str,
+ end_date=end_str,
+ )
+
+ if len(X) == 0:
+ logger.error("No training data found. Check staging layer.")
+ return
+
+ feature_cols = list(X.columns)
+ logger.info(f"Features: {len(feature_cols)} columns")
+ logger.info(f"Samples: {len(X)} games")
+ logger.info(f"Residual stats: mean={y.mean():.2f}, std={y.std():.2f}")
+
+ # Load closing_total and actual_total for evaluation
+ events = pd.read_parquet(staging_path / "events.parquet")
+ line_features = pd.read_parquet(staging_path / "line_features.parquet")
+
+ # Re-merge to get closing_total aligned with X's index
+ events["game_date"] = pd.to_datetime(events["game_date"])
+ events_filtered = events[
+ (events["game_date"] >= pd.Timestamp(start_str))
+ & (events["game_date"] <= pd.Timestamp(end_str))
+ & events["home_score"].notna()
+ ]
+ merged_meta = events_filtered.merge(line_features, on="event_id", how="inner")
+ merged_meta = merged_meta[merged_meta["closing_total"].notna()]
+
+ # Align indices
+ closing_totals = merged_meta["closing_total"].reset_index(drop=True)
+ actual_totals = (merged_meta["home_score"] + merged_meta["away_score"]).reset_index(drop=True)
+ X = X.reset_index(drop=True)
+ y = y.reset_index(drop=True)
+
+ # Train/val split
+ X_train, X_val, y_train, y_val, ct_train, ct_val, at_train, at_val = train_test_split(
+ X,
+ y,
+ closing_totals,
+ actual_totals,
+ test_size=0.2,
+ random_state=42,
+ )
+
+ logger.info(f"Train: {len(X_train)} games")
+ logger.info(f"Val: {len(X_val)} games")
+
+ # Train model
+ model = train_residual_model(X_train, y_train, X_val, y_val)
+
+ # Evaluate on validation set
+ val_metrics = evaluate_residual_model(model, X_val, y_val, ct_val, at_val, "Validation")
+
+ # Feature importance
+ log_feature_importance(model, feature_cols)
+
+ # Holdout evaluation (if provided)
+ holdout_metrics: dict[str, float] = {}
+ if holdout_start and holdout_end:
+ ho_start_str = holdout_start.strftime("%Y-%m-%d")
+ ho_end_str = holdout_end.strftime("%Y-%m-%d")
+
+ logger.info(f"\nEvaluating holdout: {ho_start_str} to {ho_end_str}")
+
+ X_ho, y_ho = engineer.build_totals_residual_dataset(
+ start_date=ho_start_str,
+ end_date=ho_end_str,
+ )
+
+ if len(X_ho) > 0:
+ # Get closing/actual for holdout
+ ho_events = events[
+ (events["game_date"] >= pd.Timestamp(ho_start_str))
+ & (events["game_date"] <= pd.Timestamp(ho_end_str))
+ & events["home_score"].notna()
+ ]
+ ho_merged = ho_events.merge(line_features, on="event_id", how="inner")
+ ho_merged = ho_merged[ho_merged["closing_total"].notna()]
+
+ ho_closing = ho_merged["closing_total"].reset_index(drop=True)
+ ho_actual = (ho_merged["home_score"] + ho_merged["away_score"]).reset_index(drop=True)
+ X_ho = X_ho.reset_index(drop=True)
+ y_ho = y_ho.reset_index(drop=True)
+
+ holdout_metrics = evaluate_residual_model(
+ model, X_ho, y_ho, ho_closing, ho_actual, "Holdout"
+ )
+ else:
+ logger.warning("No holdout data found")
+
+ # Retrain on all data for production model
+ logger.info("\nRetraining on full dataset for production...")
+ production_model = train_residual_model(
+ X,
+ y,
+ X_val,
+ y_val, # Use val set for early stopping reference
+ )
+
+ # Save model and metadata
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ model_path = output_dir / "totals_residual_2026.pkl"
+ with open(model_path, "wb") as f:
+ pickle.dump(production_model, f)
+ logger.info(f"[OK] Saved model: {model_path}")
+
+ # Save feature names
+ features_path = output_dir / "totals_residual_features.txt"
+ features_path.write_text("\n".join(feature_cols))
+ logger.info(f"[OK] Saved features: {features_path}")
+
+ # Save model metadata
+ residual_std = float(y.std())
+ metadata = {
+ "model_type": "totals_residual_regression",
+ "training_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ "date_range": f"{start_str} to {end_str}",
+ "samples": len(X),
+ "features": len(feature_cols),
+ "residual_mean": float(y.mean()),
+ "residual_std": residual_std,
+ "validation_metrics": {k: round(v, 4) for k, v in val_metrics.items()},
+ }
+ if holdout_metrics:
+ metadata["holdout_metrics"] = {k: round(v, 4) for k, v in holdout_metrics.items()}
+
+ metadata_path = output_dir / "totals_residual_metadata.json"
+ with open(metadata_path, "w") as f:
+ json.dump(metadata, f, indent=2)
+ logger.info(f"[OK] Saved metadata: {metadata_path}")
+
+ logger.info("\n" + "=" * 80)
+ logger.info("Training Complete")
+ logger.info(f" RESIDUAL_STDDEV = {residual_std:.2f}")
+ logger.info(" Use: predicted_total = closing_total + model.predict(X)")
+ logger.info(" Use: over_prob = norm.cdf(predicted_residual / RESIDUAL_STDDEV)")
+ logger.info("=" * 80)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/chrome-devtools/SKILL.md b/skills/chrome-devtools/SKILL.md
index 551c03be8..ea8af6433 100644
--- a/skills/chrome-devtools/SKILL.md
+++ b/skills/chrome-devtools/SKILL.md
@@ -1,44 +1,128 @@
---
name: chrome-devtools
-description: Uses Chrome DevTools via MCP for efficient debugging, troubleshooting and browser automation. Use when debugging web pages, automating browser interactions, analyzing performance, or inspecting network requests.
+description: Uses Chrome DevTools via MCP for debugging, browser automation, web scraping with Puppeteer, WebSocket inspection, performance analysis, and network debugging. Use when automating browsers, scraping pages, intercepting or analyzing WebSocket traffic, debugging web apps, or recording performance traces.
---
-## Core Concepts
+## Project overview
+
+**Chrome DevTools MCP** is an MCP server that controls a live Chrome instance via Puppeteer and Chrome DevTools Protocol (CDP). It gives AI assistants tools for navigation, interaction, snapshots, network inspection (including WebSocket), console messages, performance traces, and script evaluation. The server starts Chrome on first tool use; you can also connect to an existing Chrome with `--browser-url` or `--wsEndpoint`.
+
+- **Docs**: [Tool reference](../../docs/tool-reference.md), [Design principles](../../docs/design-principles.md), [Troubleshooting](../../docs/troubleshooting.md)
+- **Puppeteer**: Used under the hood for browser launch, pages, and CDP; you interact via MCP tools, not Puppeteer API directly.
+
+## Core concepts
**Browser lifecycle**: Browser starts automatically on first tool call using a persistent Chrome profile. Configure via CLI args in the MCP server configuration: `npx chrome-devtools-mcp@latest --help`.
**Page selection**: Tools operate on the currently selected page. Use `list_pages` to see available pages, then `select_page` to switch context.
-**Element interaction**: Use `take_snapshot` to get page structure with element `uid`s. Each element has a unique `uid` for interaction. If an element isn't found, take a fresh snapshot - the element may have been removed or the page changed.
+**Element interaction**: Use `take_snapshot` to get page structure with element `uid`s. Each element has a unique `uid` for interaction. If an element isn't found, take a fresh snapshot—the element may have been removed or the page changed.
+
+## Discovering scraping opportunities (Network-first)
+
+**Best first step**: Use the **Network** tools to see *how* the page gets its data. Many sites load content via **XHR** or **fetch**; if you find that API and its response is JSON (or structured), you can often use the same API instead of parsing HTML.
+
+1. **Load and trigger**: `navigate_page` to the target URL; if needed, interact (click, search) so the data you want is loaded.
+2. **List API requests**: `list_network_requests` with **resourceTypes: `['xhr', 'fetch']`** to see only API-style requests. Scan for URLs that look like data endpoints (e.g. `api/`, `graphql`, query params).
+3. **Inspect one**: `get_network_request(reqid)` for a promising request. Check **response body**—if it’s JSON with the data you need, prefer **reusing that API** (same URL, method, headers, body) over DOM scraping.
+4. **Decide**: Structured response → use the API. No usable API or auth-only → use DOM: snapshot + `evaluate_script` (see “Workflow: Web scraping” below).
+
+Full step-by-step and “API vs DOM” guidance: [network-for-scraping-discovery.md](./network-for-scraping-discovery.md).
+
+## Workflow: Web scraping (Puppeteer-style)
+
+Use the MCP tools as a “Puppeteer-like” scraping pipeline: navigate, wait, snapshot, then extract data with snapshots and/or page scripts.
+
+1. **Navigate**: `navigate_page` (type=url, url=…) or `new_page` (url=…).
+2. **Wait**: `wait_for` (text=…) to wait for specific content, or use a short delay and then snapshot.
+3. **Snapshot**: `take_snapshot` to get the accessibility tree and element `uid`s. Prefer snapshot over screenshot for automation (faster, text-based).
+4. **Extract**:
+ - **From tree**: Use the snapshot text and structure; click/fill by `uid` if you need to open modals or paginate.
+ - **From page**: Use `evaluate_script` to run JavaScript in the page and return JSON-serializable data (e.g. `() => document.querySelectorAll('h2').length`, or extract table rows, meta tags, or any DOM/data).
+5. **Pagination or multi-page**: Use `click` on “next” (by `uid`), then wait + snapshot again, or `navigate_page` to new URLs and repeat.
+
+**Tips**: Use `filePath` on `take_snapshot` for large pages. For data not in the a11y tree (e.g. attributes, computed styles), use `evaluate_script`. Iframes are not in the snapshot—only the main frame is represented.
+
+## Workflow: WebSocket inspection / “interception”
+
+The server does not inject code into the page; it uses DevTools network data. You can **list and inspect** WebSocket (and other) requests.
+
+1. **Navigate** to the page that opens WebSockets: `navigate_page` (url=…).
+2. **Trigger** the WebSocket (use the app or wait for it to connect).
+3. **List requests**: `list_network_requests` with `resourceTypes: ['websocket']` to see only WebSocket requests. Omit `resourceTypes` to see all (document, xhr, fetch, websocket, etc.).
+4. **Inspect one request**: `get_network_request` with the `reqid` from the list. Use `requestFilePath` / `responseFilePath` to save bodies to files (useful for large or binary payloads).
-## Workflow Patterns
+**Resource types** (for `list_network_requests`) include: `document`, `stylesheet`, `image`, `media`, `font`, `script`, `xhr`, `fetch`, `eventsource`, **`websocket`**, `manifest`, `ping`, `preflight`, `other`, etc. Use these to filter by kind of traffic.
-### Before interacting with a page
+**Note**: You see WebSocket *requests* (URL, timing, headers). Live message-by-message capture is limited to what DevTools exposes in the network list and request/response details.
+
+## General workflow (before interacting)
1. Navigate: `navigate_page` or `new_page`
-2. Wait: `wait_for` to ensure content is loaded if you know what you look for.
-3. Snapshot: `take_snapshot` to understand page structure
-4. Interact: Use element `uid`s from snapshot for `click`, `fill`, etc.
+2. Wait: `wait_for` if you know what text to wait for
+3. Snapshot: `take_snapshot` to get structure and `uid`s
+4. Interact: Use `uid`s from snapshot for `click`, `fill`, `fill_form`, `hover`, `drag`, `press_key`, `upload_file`
+5. Dialogs: Use `handle_dialog` (accept/dismiss, optional `promptText`) when alerts/confirms appear
+
+## Tool selection quick reference
+
+| Goal | Preferred tool(s) |
+|------|-------------------|
+| Automation / scraping | `take_snapshot`, `click`, `fill`, `evaluate_script` |
+| Visual check | `take_screenshot` (optionally `fullPage`, `uid` for element) |
+| Data not in a11y tree | `evaluate_script` (must return JSON-serializable values) |
+| List HTTP/WebSocket requests | `list_network_requests` (optional `resourceTypes: ['websocket']`) |
+| Inspect one request/response | `get_network_request` (reqid, optional file paths for bodies) |
+| Console errors/warnings | `list_console_messages` (optional `types`), `get_console_message` |
+| Performance / CWV | `performance_start_trace`, `performance_stop_trace`, `performance_analyze_insight` (includes CrUX field data) |
+| Emulation | `emulate` (viewport, userAgent, networkConditions, etc.), `resize_page` |
+
+See [Tool reference](../../docs/tool-reference.md) for full parameters.
+
+## Formatters (internal)
+
+Tool responses are shaped by internal formatters. You don’t call them directly; they affect what the agent sees:
+
+- **SnapshotFormatter**: Turns the a11y tree into the text snapshot with `uid`s and optional “selected in DevTools” hint. Use `verbose: true` on `take_snapshot` for more detail.
+- **NetworkFormatter**: Formats request/response (URL, status, headers, body). Large bodies can be truncated or written to `requestFilePath`/`responseFilePath`.
+- **ConsoleFormatter**: Formats console messages (level, text, stack, resolved arguments when detailed data is requested). Error objects logged via `console.log(new Error(...))` now include the full message, source-mapped stack trace (1-based line/column), and `Error.cause` chain (shown as nested "Caused by:" sections).
+- **IssueFormatter**: Formats DevTools "issues" (e.g. deprecations, violations) when included in responses.
-### Efficient data retrieval
+A full **Network & Console breakdown** (data flow, collectors, filter options, what you see in responses) is in [network-and-console-breakdown.md](./network-and-console-breakdown.md).
-- Use `filePath` parameter for large outputs (screenshots, snapshots, traces)
-- Use pagination (`pageIdx`, `pageSize`) and filtering (`types`) to minimize data
-- Set `includeSnapshot: false` on input actions unless you need updated page state
+## Performance traces and CrUX field data
-### Tool selection
+Performance traces now include **CrUX (Chrome User Experience Report)** real-user field metrics alongside lab data:
-- **Automation/interaction**: `take_snapshot` (text-based, faster, better for automation)
-- **Visual inspection**: `take_screenshot` (when user needs to see visual state)
-- **Additional details**: `evaluate_script` for data not in accessibility tree
+- **Metrics shown**: LCP (with breakdown: TTFB, load delay, load duration, render delay), INP, CLS
+- **Scope**: Data may be for the specific URL or the entire origin, indicated in the output
+- **Privacy**: URLs from traces are sent to Google's CrUX API to fetch field data
+- **Disable**: Start the server with `--no-performance-crux` to opt out of CrUX data
-### Parallel execution
+## Error debugging improvements
-You can send multiple tool calls in parallel, but maintain correct order: navigate → wait → snapshot → interact.
+Stack traces for uncaught errors and `console.log(Error)` are now **source-mapped** (showing original file paths and 1-based line/column numbers instead of minified bundles). Error objects also display their full **Error.cause** chain as nested "Caused by:" sections with their own stack traces.
+
+## Telemetry
+
+Google collects usage statistics (e.g. tool invocation success, latency, environment) to improve the server. Collection is **on by default**.
+
+- **Opt-out**: Start the server with `--no-usage-statistics` (or set `CHROME_DEVTOOLS_MCP_NO_USAGE_STATISTICS` or `CI`).
+- **Config example**: `"args": ["-y", "chrome-devtools-mcp@latest", "--no-usage-statistics"]`
+- Data is handled per [Google Privacy Policy](https://policies.google.com/privacy); independent of Chrome browser metrics.
+
+## Efficient usage
+
+- Use `filePath` for large outputs (screenshots, snapshots, traces, request/response bodies).
+- Use pagination (`pageIdx`, `pageSize`) and filters (`resourceTypes`, `types` for console) to limit data.
+- Set `includeSnapshot: false` on input actions (click, fill, etc.) unless you need an updated snapshot in the same response.
+- Run independent tool calls in parallel when order allows (e.g. multiple `get_network_request` by reqid); keep sequence for navigate → wait → snapshot → interact.
## Troubleshooting
-If `chrome-devtools-mcp` is insufficient, guide users to use Chrome DevTools UI:
+If the MCP tools are insufficient, suggest using Chrome DevTools directly:
- https://developer.chrome.com/docs/devtools
- https://developer.chrome.com/docs/devtools/ai-assistance
+
+For connection issues, headless vs headed, or remote debugging, see [Troubleshooting](../../docs/troubleshooting.md).
diff --git a/skills/chrome-devtools/network-and-console-breakdown.md b/skills/chrome-devtools/network-and-console-breakdown.md
new file mode 100644
index 000000000..a4ae5408a
--- /dev/null
+++ b/skills/chrome-devtools/network-and-console-breakdown.md
@@ -0,0 +1,177 @@
+# Network & Console in Chrome DevTools MCP – Breakdown
+
+This document explains how the **Network** and **Console** panels in Chrome DevTools map to this MCP server: data sources, tools, formatters, and what you see in responses.
+
+---
+
+## 1. Network (DevTools → MCP)
+
+### What DevTools exposes
+
+The Chrome DevTools **Network** panel shows every request made by the page: document, XHR/fetch, scripts, styles, images, **WebSocket**, etc. Each request has URL, method, status, timing, headers, and request/response bodies. The MCP server uses the same underlying data (Puppeteer’s `HTTPRequest` from the `request` event).
+
+### Data flow in this project
+
+```
+Page "request" event (Puppeteer)
+ → NetworkCollector (PageCollector)
+ → stored per page, per navigation (last 3 navigations)
+ → McpContext.getNetworkRequests() / getNetworkRequestById()
+ → McpResponse: NetworkFormatter (summary or detailed)
+ → tool response text + structuredContent
+```
+
+- **NetworkCollector** (`PageCollector.ts`): Subscribes to each page’s `request` event. On main-frame navigation it splits storage so “current” requests are the ones since the last navigation. Preserves up to 3 navigations when `includePreservedRequests` is true.
+- **Stable ID**: Each request gets a numeric `reqid` (stable for the session) so you can refer to it in `get_network_request(reqid)`.
+
+### MCP tools
+
+| Tool | Purpose |
+|------|--------|
+| **list_network_requests** | List requests for the selected page (current navigation, or last 3 if `includePreservedRequests: true`). Optional: `pageSize`, `pageIdx`, **`resourceTypes`**, `includePreservedRequests`. |
+| **get_network_request** | Get one request by `reqid` (or the request currently selected in DevTools UI if no reqid). Optional: `requestFilePath`, `responseFilePath` to save bodies to files. |
+
+### Resource types (filtering)
+
+Use **`resourceTypes`** in `list_network_requests` to filter. Allowed values (same as DevTools):
+
+| Type | Typical use |
+|------|-------------|
+| `document` | Main document / navigations |
+| `stylesheet` | CSS |
+| `script` | JS |
+| `image`, `media`, `font` | Assets |
+| `xhr`, `fetch` | API / fetch() |
+| **`websocket`** | WebSocket connections |
+| `eventsource` | Server-Sent Events |
+| `manifest`, `ping`, `preflight`, `other`, etc. | Other |
+
+Example: only WebSockets → `resourceTypes: ['websocket']`.
+
+### What you see in the response
+
+- **List (summary)**: For each request, one line like:
+ `reqid= [success - 200]` (or `[failed - ...]`, `[pending]`). Optionally `[selected in the DevTools Network panel]` if it matches the DevTools selection.
+- **Single request (detailed)** from `get_network_request`: Request URL, status, request headers, request body (or “saved to …”), response headers, response body (or “saved to …”), failure text if any, redirect chain. Large bodies are truncated in-line or written to the path you passed.
+
+### Formatter (internal)
+
+**NetworkFormatter** (`formatters/NetworkFormatter.ts`):
+
+- **Summary**: `toString()` → one line (reqid, method, URL, status).
+- **Detailed**: `toStringDetailed()` → full headers and bodies (or file path when saved). Used when you call `get_network_request`.
+- **Bodies**: Truncated to 10k chars in-line; use `requestFilePath` / `responseFilePath` for large or binary data.
+
+---
+
+## 2. Console (DevTools → MCP)
+
+### What DevTools exposes
+
+The Chrome DevTools **Console** panel shows:
+
+- **Console messages**: `console.log`, `console.error`, `console.warn`, etc., plus type (log, debug, info, error, warn, dir, table, trace, …).
+- **Uncaught exceptions**: Runtime errors (from CDP `Runtime.exceptionThrown`).
+- **Issues**: Aggregated DevTools “issues” (e.g. deprecations, violations) from the Audits/Issues system (CDP `Audits.issueAdded`).
+
+The MCP server collects all of these and exposes them as a single list with a stable numeric **msgid** per entry.
+
+### Data flow in this project
+
+```
+Page / CDP events
+ → ConsoleCollector (PageCollector)
+ → PageEventSubscriber: console, uncaughtError, issue
+ → stored per page, per navigation (last 3 navigations)
+ → McpContext.getConsoleData() / getConsoleMessageById()
+ → McpResponse: ConsoleFormatter or IssueFormatter
+ → tool response text + structuredContent
+```
+
+- **ConsoleCollector** (`PageCollector.ts`): Extends `PageCollector`; each page gets a **PageEventSubscriber** that:
+ - Listens to the page’s **console** event (Puppeteer) for `console.*` messages.
+ - Listens to CDP **Runtime.exceptionThrown** and emits **uncaughtError** (wrapped as `Error`-like).
+ - Enables **Audits** and listens to **Audits.issueAdded**, then uses DevTools’ **IssueAggregator** to emit **issue** (AggregatedIssue).
+- Storage is split on main-frame navigation; you can ask for messages from the last 3 navigations with `includePreservedMessages: true`.
+
+### MCP tools
+
+| Tool | Purpose |
+|------|--------|
+| **list_console_messages** | List console messages (and issues/uncaught errors) for the selected page. Optional: `pageSize`, `pageIdx`, **`types`**, `includePreservedMessages`. |
+| **get_console_message** | Get one message by **msgid** with full detail (resolved arguments, stack trace for console messages; issue description and affected resources for issues). |
+
+### Message types (filtering)
+
+Use **`types`** in `list_console_messages` to filter. Allowed values:
+
+- **Console**: `log`, `debug`, `info`, `error`, `warn`, `dir`, `dirxml`, `table`, `trace`, `clear`, `startGroup`, `startGroupCollapsed`, `endGroup`, `assert`, `profile`, `profileEnd`, `count`, `timeEnd`, `verbose`.
+- **Special**: `issue` (DevTools aggregated issues), and uncaught errors are treated as type `error`.
+
+Example: only errors and issues → `types: ['error', 'issue']`.
+
+### What you see in the response
+
+- **List (summary)**: For each message, one line:
+ `msgid= [] (N args)`
+ For issues: `msgid= [issue] (count: N)`.
+- **Single message (detailed)** from `get_console_message`:
+ - **Console message**: ID, type, message text, **Arguments** (resolved values), **Stack trace** (with file:line when available).
+ - **Issue**: ID, description (markdown), “Learn more” links, **Affected resources** (e.g. request reqid, element uid).
+ - **Uncaught error**: ID, message, **source-mapped stack trace** (1-based line/column), and **Error.cause chain** (nested "Caused by:" sections, each with its own message and stack).
+
+### Error object formatting (v0.16.0+)
+
+When an Error object is logged via `console.log(new Error(...))` or thrown as an uncaught exception, the ConsoleFormatter now extracts:
+
+- **Message**: The error message string.
+- **Source-mapped stack trace**: Stack frames are resolved through source maps, showing original file paths and 1-based line/column numbers instead of minified/compiled references.
+- **Error.cause chain**: If the error has a `.cause`, it's shown as a "Caused by:" section with its own message and stack. Chains are followed recursively.
+
+Example detailed output for an uncaught error with a cause chain:
+```
+Message: error> Uncaught Error: foo failed
+### Stack trace
+at Iife (main.js:18:11)
+at (main.js:14:1)
+Caused by: Error: bar failed
+at foo (main.js:10:11)
+...
+Caused by: Error: b00m!
+at bar (main.js:3:9)
+...
+```
+
+### Formatters (internal)
+
+- **ConsoleFormatter** (`formatters/ConsoleFormatter.ts`):
+ - **Summary**: `toString()` → `msgid=X [type] text (N args)`.
+ - **Detailed**: `toStringDetailed()` → ID, message, Arguments, Stack trace. For detailed, it can resolve `args` via `jsonValue()` and resolve stack via DevTools (when available). Error-subtype arguments are expanded with message, stack, and cause chain.
+- **IssueFormatter** (`formatters/IssueFormatter.ts`):
+ - **Summary**: `toString()` → `msgid=X [issue] title (count: N)`.
+ - **Detailed**: `toStringDetailed()` → ID, description, links, affected resources (request id, element uid, etc.).
+
+---
+
+## 3. Side-by-side summary
+
+| Aspect | Network | Console |
+|--------|---------|--------|
+| **DevTools panel** | Network | Console (+ Issues) |
+| **Data source** | Page `request` (Puppeteer) | Page `console` + CDP `Runtime.exceptionThrown` + CDP `Audits.issueAdded` |
+| **Collector** | NetworkCollector (PageCollector<HTTPRequest>) | ConsoleCollector (PageCollector<ConsoleMessage \| Error \| AggregatedIssue>) |
+| **List tool** | list_network_requests | list_console_messages |
+| **Get-one tool** | get_network_request(reqid) | get_console_message(msgid) |
+| **Filter param** | resourceTypes (e.g. websocket, xhr, fetch) | types (e.g. error, log, issue) |
+| **Stable ID** | reqid (number) | msgid (number) |
+| **Preserved data** | includePreservedRequests (last 3 navs) | includePreservedMessages (last 3 navs) |
+| **Formatter** | NetworkFormatter (summary vs detailed; body truncation / file) | ConsoleFormatter, IssueFormatter (summary vs detailed; args + stack) |
+
+---
+
+## 4. Practical usage
+
+- **Network**: Navigate → trigger traffic → `list_network_requests` (optionally `resourceTypes: ['websocket']` or `['xhr','fetch']`) → use a `reqid` in `get_network_request` for headers/bodies; use `requestFilePath`/`responseFilePath` for large payloads.
+- **Console**: After load or action → `list_console_messages` (optionally `types: ['error','issue']`) → use a `msgid` in `get_console_message` for full stack and resolved arguments.
+
+Both use the same **selected page** and support **pagination** (`pageSize`, `pageIdx`) in the list tools.
diff --git a/skills/chrome-devtools/network-for-scraping-discovery.md b/skills/chrome-devtools/network-for-scraping-discovery.md
new file mode 100644
index 000000000..b994c9a63
--- /dev/null
+++ b/skills/chrome-devtools/network-for-scraping-discovery.md
@@ -0,0 +1,98 @@
+# Using the Network Panel to Discover Web Scraping Opportunities
+
+The **Network** tools in Chrome DevTools MCP are often the best way to find *how* a site gets its data—and whether you should scrape the DOM or use the same API the page uses. This guide focuses on that discovery workflow.
+
+---
+
+## Why start with Network?
+
+Many modern sites don’t put the data you want in the initial HTML. They:
+
+- Load the shell (HTML/JS/CSS), then
+- Call **XHR** or **fetch** APIs to get JSON (or other structured data) and render it in the DOM.
+
+If you only look at the DOM, you’re scraping what the front end already fetched and rendered. If you **discover the underlying request** (URL, method, headers, query/payload), you can often:
+
+- Get **structured data** (e.g. JSON) instead of parsing HTML.
+- Reuse the **exact same API** the page uses (sometimes with pagination, filters, or search params).
+- Reduce brittleness when the site changes its layout but not its API.
+
+So the best first step for “how do I scrape this?” is often: **inspect Network, focus on XHR/fetch, then decide API vs DOM.**
+
+---
+
+## Discovery workflow (Network-first)
+
+### 1. Load the page and trigger the data you care about
+
+- **Navigate**: `navigate_page` (url = the page that shows the data).
+- **Trigger**: If the data appears only after a click, search, or scroll, do that (e.g. `click` on “Load more”, type in search, open a tab). The goal is to make the page issue the requests that deliver the content you want.
+
+### 2. List requests, focus on XHR and fetch
+
+- Call **list_network_requests** with:
+ - **resourceTypes: `['xhr', 'fetch']`**
+ This filters out documents, scripts, images, etc., and shows the API-style requests.
+- Scan the list: each line is `reqid= [status]`. Look for:
+ - URLs that look like APIs (e.g. contain `api`, `graphql`, `search`, `list`, query params).
+ - POST/GET to domains you care about.
+ - Status `[success - 200]` (or 201, etc.) so the response is likely useful.
+
+### 3. Inspect promising requests
+
+- Pick a **reqid** from the list and call **get_network_request** with that `reqid`.
+- Check:
+ - **Request**: URL (full path + query string), method, headers (e.g. `Authorization`, `Content-Type`), request body (for POST/PUT). You’ll need these if you later call the API yourself (e.g. from a script).
+ - **Response**: Response body. If it’s **JSON** (or another structured format) with the data you need, that’s a strong signal that **using the API** may be better than scraping the DOM.
+- For large responses, use **responseFilePath** (and **requestFilePath** if needed) so the full body is written to a file instead of truncated in the tool output.
+
+### 4. Decide: API vs DOM scraping
+
+| If the response… | Prefer |
+|------------------|--------|
+| Is JSON (or structured) and contains the data you need | **API**: replay the request (same URL, method, headers, body). You can do that from your own code or, for exploration, via `evaluate_script` + `fetch()` in the page context. |
+| Is HTML or the “data” is only in the rendered page | **DOM**: use `take_snapshot` + `evaluate_script` to extract from the document. |
+| Is mixed (e.g. some data in API, some only in DOM) | Combine: use API where possible, DOM for the rest. |
+
+### 5. Document what you found
+
+- Note: **URL**, **method**, **important headers** (e.g. auth, content-type), **query params** or **body** that affect the result (e.g. page number, search term).
+- If you see pagination (e.g. `?page=2` or `offset=20`), you’ve found a way to get more data without clicking “Next” in the UI.
+
+---
+
+## Network tools quick reference (for discovery)
+
+| Goal | Tool / params |
+|------|----------------|
+| See only API-like requests | **list_network_requests** with `resourceTypes: ['xhr', 'fetch']` |
+| See WebSocket connections | `resourceTypes: ['websocket']` |
+| See everything (no filter) | **list_network_requests** without `resourceTypes` |
+| Inspect one request (headers + body) | **get_network_request** with `reqid` from the list |
+| Save large response to file | **get_network_request** with `reqid`, `responseFilePath: 'path/to/file.json'` (and optionally `requestFilePath` for the request body) |
+| Requests from last few navigations | **list_network_requests** with `includePreservedRequests: true` |
+
+---
+
+## When to prefer API over DOM
+
+- **Structured data**: The response is JSON/XML with clear fields (e.g. list of items, each with id, name, price). Easier to parse than HTML.
+- **Pagination / filters**: The API accepts query params or body fields for page, limit, sort, search. One request per page instead of simulating clicks.
+- **Less layout dependency**: Site redesigns often keep the same API; DOM selectors break when the markup changes.
+- **Rate and volume**: You can throttle and retry at the HTTP level; no need to render the page for every batch.
+
+## When to prefer DOM over API
+
+- **No usable API**: The data is only in the HTML (server-rendered), or the API is heavily protected/undocumented.
+- **Auth / cookies**: The data loads only when the user is logged in and the page sets cookies; replaying the request from outside the browser may require copying cookies or using the browser context (e.g. `evaluate_script` + `fetch()` in the page).
+- **Anti-bot**: The site checks browser behavior; using the real page (navigate, click, snapshot) may be necessary.
+
+---
+
+## Summary
+
+- **Best for discovering scraping opportunities**: Use **Network** first—**list_network_requests** with `resourceTypes: ['xhr', 'fetch']` to find API calls, then **get_network_request(reqid)** to inspect request/response.
+- If the response is structured and has the data you need, **prefer using that API** (same URL, method, headers, body) instead of scraping the DOM.
+- If not, or if you need to stay in the browser for auth/behavior, use the usual **DOM scraping** flow: navigate → wait → snapshot → `evaluate_script` (and optionally Network to verify what the page requested).
+
+For full Network tool and formatter details, see [network-and-console-breakdown.md](./network-and-console-breakdown.md). For the general scraping workflow (DOM + script), see [SKILL.md](./SKILL.md).
diff --git a/skills/chrome-devtools/reference.md b/skills/chrome-devtools/reference.md
new file mode 100644
index 000000000..0f0a0fbae
--- /dev/null
+++ b/skills/chrome-devtools/reference.md
@@ -0,0 +1,66 @@
+# Chrome DevTools MCP – Quick reference
+
+Use this alongside [SKILL.md](./SKILL.md). Full parameter details: [docs/tool-reference.md](../../docs/tool-reference.md).
+**Network & Console deep dive**: [network-and-console-breakdown.md](./network-and-console-breakdown.md).
+**Using Network to discover scraping opportunities**: [network-for-scraping-discovery.md](./network-for-scraping-discovery.md).
+
+## Tools by category
+
+### Input automation (8)
+- **click** – Click element by `uid` (optional `dblClick`)
+- **drag** – `from_uid`, `to_uid`
+- **fill** – Type into input/textarea or select option by `uid`, `value`
+- **fill_form** – Fill multiple elements at once (`elements`: `[{uid, value}, …]`)
+- **handle_dialog** – accept / dismiss; optional `promptText`
+- **hover** – Hover by `uid`
+- **press_key** – e.g. `"Enter"`, `"Control+A"`
+- **upload_file** – `uid` (file input), `filePath` (local path)
+
+### Navigation (6)
+- **list_pages** – List open pages
+- **select_page** – `pageId`, optional `bringToFront`
+- **navigate_page** – `type`: url | back | forward | reload; `url` when type=url; optional `timeout`, `ignoreCache`, `initScript`, `handleBeforeUnload`
+- **new_page** – `url`; optional `background`, `timeout`
+- **close_page** – `pageId` (cannot close last page)
+- **wait_for** – `text` to appear; optional `timeout`
+
+### Emulation (2)
+- **emulate** – Optional: `viewport`, `userAgent`, `colorScheme`, `geolocation`, `networkConditions`, `cpuThrottlingRate` (set to null to clear)
+- **resize_page** – `width`, `height`
+
+### Performance (3)
+- **performance_start_trace** – `reload`, `autoStop`; optional `filePath` for trace file
+- **performance_stop_trace** – Optional `filePath`. Now includes **CrUX field data** (LCP with breakdown, INP, CLS) from real users alongside lab metrics. Disable with `--no-performance-crux`.
+- **performance_analyze_insight** – `insightSetId`, `insightName` (from trace results)
+
+### Network (2)
+- **list_network_requests** – Optional: `pageSize`, `pageIdx`, `resourceTypes`, `includePreservedRequests`
+ - **resourceTypes** (array): e.g. `['websocket']`, `['xhr','fetch']`, `['document']` — or omit for all. Values: document, stylesheet, image, media, font, script, texttrack, xhr, fetch, prefetch, eventsource, **websocket**, manifest, signedexchange, ping, cspviolationreport, preflight, fedcm, other
+- **get_network_request** – Optional `reqid` (else uses DevTools selection); optional `requestFilePath`, `responseFilePath` to save bodies
+
+### Debugging (5)
+- **take_snapshot** – Optional `verbose`, `filePath`. Returns a11y tree with element `uid`s.
+- **take_screenshot** – Optional `uid`, `fullPage`, `format`, `quality`, `filePath`
+- **evaluate_script** – `function` (JS function as string), optional `args` (array of `{uid}`). Return value must be JSON-serializable.
+- **list_console_messages** – Optional `pageSize`, `pageIdx`, `types`, `includePreservedMessages`
+- **get_console_message** – `msgid`. Error objects show source-mapped stacks (1-based line/column) and Error.cause chains.
+
+## Formatters (internal)
+
+| Formatter | Used for | Notes |
+|-----------|----------|--------|
+| SnapshotFormatter | `take_snapshot` output | Text snapshot with `uid`s; `verbose` adds more a11y data |
+| NetworkFormatter | `list_network_requests`, `get_network_request` | URL, status, headers, body (truncated or saved to file) |
+| ConsoleFormatter | `list_console_messages`, `get_console_message` | Level, text, source-mapped stack, resolved args, Error.cause chains |
+| IssueFormatter | DevTools issues in responses | Deprecations, violations, etc. |
+
+## Telemetry
+
+- **Default**: Usage statistics enabled (tool success, latency, environment).
+- **Disable**: `--no-usage-statistics` or env `CHROME_DEVTOOLS_MCP_NO_USAGE_STATISTICS` or `CI`.
+
+## Common patterns
+
+- **Scraping**: navigate → wait_for → take_snapshot → evaluate_script (or parse snapshot) → repeat for next page.
+- **WebSocket inspection**: navigate → trigger WS → list_network_requests(resourceTypes: ['websocket']) → get_network_request(reqid).
+- **Form + submit**: take_snapshot → fill or fill_form (by uid) → click submit button (by uid) → wait_for or take_snapshot.
diff --git a/snapshot-odds.txt b/snapshot-odds.txt
new file mode 100644
index 000000000..a1471525f
--- /dev/null
+++ b/snapshot-odds.txt
@@ -0,0 +1,1127 @@
+uid=7_0 RootWebArea "2026 College Basketball Odds, Spreads & Betting Lines" url="https://www.actionnetwork.com/ncaab/odds?date=20260207"
+ uid=7_1 banner
+ uid=7_2 navigation
+ uid=7_3 link "The Action Network" url="https://www.actionnetwork.com/"
+ uid=7_4 StaticText "Sports"
+ uid=7_5 StaticText "Odds"
+ uid=7_6 StaticText "Picks"
+ uid=7_7 StaticText "Tools"
+ uid=7_8 StaticText "PRO"
+ uid=7_9 StaticText "Sports Betting"
+ uid=7_10 StaticText "Prediction Markets"
+ uid=7_11 StaticText "Casinos"
+ uid=7_12 StaticText "Resources"
+ uid=7_13 button
+ uid=7_14 link url="https://action.onelink.me/qhpb/a141f9c4"
+ uid=7_15 button "Get App"
+ uid=7_16 link "Get $60 Off PRO" url="https://www.actionnetwork.com/pricing?intcmp=NavBarLoggedIn&etf=default"
+ uid=7_17 StaticText "Get $60 Off PRO"
+ uid=7_18 button "S" haspopup="menu"
+ uid=7_19 link "Game Odds" url="https://www.actionnetwork.com/ncaab/odds"
+ uid=7_20 StaticText "Game Odds"
+ uid=7_21 link "Player Props" url="https://www.actionnetwork.com/ncaab/props"
+ uid=7_22 StaticText "Player Props"
+ uid=7_23 link "Futures" url="https://www.actionnetwork.com/ncaab/futures"
+ uid=7_24 StaticText "Futures"
+ uid=7_25 region "Notifications Alt+T"
+ uid=7_26 main
+ uid=7_27 link "promotion logo Get Free $20+100% Deposit Match up to $100! Must be 18+ (19+ or 21+ depending on state of residence) and within applicable state. Not available in NJ. Full T&Cs apply. ACTION Promo Code" url="https://switchboard.actionnetwork.com/offers?affiliateId=502&campaignId=6405&stateCode=CA&context=web-homepage-header&propertyId=1"
+ uid=7_28 image "promotion logo" url="https://assets.bet-links.com/900x900/666661_Group530655.webp"
+ uid=7_29 StaticText "Get Free $20+100% Deposit Match up to $100!"
+ uid=7_30 StaticText "Must be 18+ (19+ or 21+ depending on state of residence) and within applicable state. Not available in NJ. Full T&Cs apply."
+ uid=7_31 StaticText "ACTION"
+ uid=7_32 StaticText "Promo Code"
+ uid=7_33 heading "NCAAB Odds & Betting Lines" level="1"
+ uid=7_34 combobox expandable haspopup="menu" value="NCAAB"
+ uid=7_35 option "NFL" selectable value="NFL"
+ uid=7_36 option "NCAAF" selectable value="NCAAF"
+ uid=7_37 option "NBA" selectable value="NBA"
+ uid=7_38 option "NCAAB" selectable selected value="NCAAB"
+ uid=7_39 option "NCAAW" selectable value="NCAAW"
+ uid=7_40 option "NHL" selectable value="NHL"
+ uid=7_41 option "MLB" selectable value="MLB"
+ uid=7_42 option "SOCCER" selectable value="SOCCER"
+ uid=7_43 option "WNBA" selectable value="WNBA"
+ uid=7_44 option "UFC" selectable value="UFC"
+ uid=7_45 option "NASCAR" selectable value="NASCAR"
+ uid=7_46 option "ATP" selectable value="ATP"
+ uid=7_47 option "WTA" selectable value="WTA"
+ uid=7_48 image "Right Arrow"
+ uid=7_49 combobox expandable haspopup="menu" value="Spread"
+ uid=7_50 option "Spread" selectable selected value="Spread"
+ uid=7_51 option "Total" selectable value="Total"
+ uid=7_52 option "Moneyline" selectable value="Moneyline"
+ uid=7_53 option "All Markets" selectable value="All Markets"
+ uid=7_54 image "Right Arrow"
+ uid=7_55 button "Previous Date" focusable focused
+ uid=7_56 StaticText "Sat Feb 07"
+ uid=7_57 button "Next Date"
+ uid=7_58 StaticText "Odds Settings"
+ uid=7_59 StaticText "SCHEDULED"
+ uid=7_60 StaticText "OPEN"
+ uid=7_61 StaticText "BEST ODDS"
+ uid=7_62 link "CONSENSUS" description="Consensus Sportsbook Odds and Betting Lines" url="https://switchboard.actionnetwork.com/offers?affiliateId=0&stateCode=CA&context=web-compareodds-banner&dynamic=true&deviceId=733ee47e-9153-47c5-9dcd-4ed9674559ee&userId=4081950&consentIds=C0001,C0003,C0004,C0002&denyIds=C0005"
+ uid=7_63 StaticText "CONSENSUS"
+ uid=7_64 link "BetRivers NJ logo" description="BetRivers NJ Sportsbook Odds and Betting Lines" url="https://switchboard.actionnetwork.com/offers?affiliateId=106&stateCode=CA&context=web-compareodds-banner&dynamic=true&deviceId=733ee47e-9153-47c5-9dcd-4ed9674559ee&userId=4081950&consentIds=C0001,C0003,C0004,C0002&denyIds=C0005"
+ uid=7_65 image "BetRivers NJ logo" url="https://assets.actionnetwork.com/112x28/115895_BetRivers384x96.webp"
+ uid=7_66 link "CRIS" description="CRIS Sportsbook Odds and Betting Lines" url="https://www.actionnetwork.com/ncaab/odds?date=20260207"
+ uid=7_67 StaticText "CRIS"
+ uid=7_68 link "CIRCA" description="Circa Sportsbook Odds and Betting Lines" url="https://www.actionnetwork.com/ncaab/odds?date=20260207"
+ uid=7_69 StaticText "CIRCA"
+ uid=7_70 link "Pinnacle logo" description="Pinnacle Sportsbook Odds and Betting Lines" url="https://switchboard.actionnetwork.com/offers?affiliateId=1657&stateCode=CA&context=web-compareodds-banner&dynamic=true&deviceId=733ee47e-9153-47c5-9dcd-4ed9674559ee&userId=4081950&consentIds=C0001,C0003,C0004,C0002&denyIds=C0005"
+ uid=7_71 image "Pinnacle logo" url="https://assets.actionnetwork.com/112x28/666462_Primary.webp"
+ uid=7_72 link "SPORTBET" description="Sportbet Sportsbook Odds and Betting Lines" url="https://www.actionnetwork.com/ncaab/odds?date=20260207"
+ uid=7_73 StaticText "SPORTBET"
+ uid=7_74 link "bet365 NJ logo" description="bet365 NJ Sportsbook Odds and Betting Lines" url="https://switchboard.actionnetwork.com/offers?affiliateId=174&stateCode=CA&context=web-compareodds-banner&dynamic=true&deviceId=733ee47e-9153-47c5-9dcd-4ed9674559ee&userId=4081950&consentIds=C0001,C0003,C0004,C0002&denyIds=C0005"
+ uid=7_75 image "bet365 NJ logo" url="https://assets.actionnetwork.com/112x28/676387_Bet365@1x.webp"
+ uid=7_76 link "Fanatics NJ logo" description="Fanatics NJ Sportsbook Odds and Betting Lines" url="https://switchboard.actionnetwork.com/offers?affiliateId=1228&stateCode=CA&context=web-compareodds-banner&dynamic=true&deviceId=733ee47e-9153-47c5-9dcd-4ed9674559ee&userId=4081950&consentIds=C0001,C0003,C0004,C0002&denyIds=C0005"
+ uid=7_77 image "Fanatics NJ logo" url="https://assets.actionnetwork.com/112x28/573837_FanaticsSportsbook20ALT.webp"
+ uid=7_78 link "BetMGM NJ logo" description="BetMGM NJ Sportsbook Odds and Betting Lines" url="https://switchboard.actionnetwork.com/offers?affiliateId=1&stateCode=CA&context=web-compareodds-banner&dynamic=true&deviceId=733ee47e-9153-47c5-9dcd-4ed9674559ee&userId=4081950&consentIds=C0001,C0003,C0004,C0002&denyIds=C0005"
+ uid=7_79 image "BetMGM NJ logo" url="https://assets.actionnetwork.com/112x28/779359_BetMGM800x200@1x.webp"
+ uid=7_80 link "Caesars NJ logo" description="Caesars NJ Sportsbook Odds and Betting Lines" url="https://switchboard.actionnetwork.com/offers?affiliateId=8&stateCode=CA&context=web-compareodds-banner&dynamic=true&deviceId=733ee47e-9153-47c5-9dcd-4ed9674559ee&userId=4081950&consentIds=C0001,C0003,C0004,C0002&denyIds=C0005"
+ uid=7_81 image "Caesars NJ logo" url="https://assets.actionnetwork.com/112x28/256064_caesars_800x200.webp"
+ uid=7_82 link "USC Team Icon USC 843 Penn State Team Icon Penn State 844" url="https://www.actionnetwork.com/ncaab-game/usc-penn-state-score-odds-february-8-2026/275066"
+ uid=7_83 image "USC Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/usc.png"
+ uid=7_84 StaticText "USC"
+ uid=7_85 StaticText "843"
+ uid=7_86 image "Penn State Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/psu.png"
+ uid=7_87 StaticText "Penn State"
+ uid=7_88 StaticText "844"
+ uid=7_89 StaticText "-3.5"
+ uid=7_90 StaticText "-110"
+ uid=7_91 StaticText "+3.5"
+ uid=7_92 StaticText "-110"
+ uid=7_93 StaticText "-3.5"
+ uid=7_94 StaticText "-110"
+ uid=7_95 image "BetMGM NJ Logo" url="https://assets.actionnetwork.com/32x32/40268_MGM48x48light@3x.webp"
+ uid=7_96 StaticText "+3.5"
+ uid=7_97 StaticText "-105"
+ uid=7_98 image "bet365 NJ Logo" url="https://assets.actionnetwork.com/32x32/240081_Bet365ALT.webp"
+ uid=7_99 StaticText "-3.5"
+ uid=7_100 StaticText "-110"
+ uid=7_101 StaticText "+3.5"
+ uid=7_102 StaticText "-110"
+ uid=7_103 StaticText "-3.5"
+ uid=7_104 StaticText "-114"
+ uid=7_105 StaticText "+3.5"
+ uid=7_106 StaticText "-112"
+ uid=7_107 StaticText "N/A"
+ uid=7_108 StaticText "N/A"
+ uid=7_109 StaticText "N/A"
+ uid=7_110 StaticText "N/A"
+ uid=7_111 StaticText "N/A"
+ uid=7_112 StaticText "N/A"
+ uid=7_113 StaticText "N/A"
+ uid=7_114 StaticText "N/A"
+ uid=7_115 StaticText "-3.5"
+ uid=7_116 StaticText "-115"
+ uid=7_117 StaticText "+3.5"
+ uid=7_118 StaticText "-105"
+ uid=7_119 StaticText "N/A"
+ uid=7_120 StaticText "N/A"
+ uid=7_121 StaticText "-3.5"
+ uid=7_122 StaticText "-110"
+ uid=7_123 StaticText "+3.5"
+ uid=7_124 StaticText "-110"
+ uid=7_125 StaticText "-3.5"
+ uid=7_126 StaticText "-110"
+ uid=7_127 StaticText "+3.5"
+ uid=7_128 StaticText "-110"
+ uid=7_129 StaticText "9:00 AM"
+ uid=7_130 link "Tulsa Team Icon Tulsa 841 S. Florida Team Icon S. Florida 842" url="https://www.actionnetwork.com/ncaab-game/tulsa-south-florida-score-odds-february-8-2026/276020"
+ uid=7_131 image "Tulsa Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/tsa.png"
+ uid=7_132 StaticText "Tulsa"
+ uid=7_133 StaticText "841"
+ uid=7_134 image "S. Florida Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/usfn.png"
+ uid=7_135 StaticText "S. Florida"
+ uid=7_136 StaticText "842"
+ uid=7_137 StaticText "+3.5"
+ uid=7_138 StaticText "-110"
+ uid=7_139 StaticText "-3.5"
+ uid=7_140 StaticText "-110"
+ uid=7_141 StaticText "+4"
+ uid=7_142 StaticText "-110"
+ uid=7_143 image "Caesars NJ Logo" url="https://assets.actionnetwork.com/32x32/463646_Caesars.webp"
+ uid=7_144 StaticText "-3.5"
+ uid=7_145 StaticText "-110"
+ uid=7_146 image "BetMGM NJ Logo" url="https://assets.actionnetwork.com/32x32/40268_MGM48x48light@3x.webp"
+ uid=7_147 StaticText "+3.5"
+ uid=7_148 StaticText "-110"
+ uid=7_149 StaticText "-3.5"
+ uid=7_150 StaticText "-110"
+ uid=7_151 StaticText "+3.5"
+ uid=7_152 StaticText "-108"
+ uid=7_153 StaticText "-3.5"
+ uid=7_154 StaticText "-118"
+ uid=7_155 StaticText "N/A"
+ uid=7_156 StaticText "N/A"
+ uid=7_157 StaticText "N/A"
+ uid=7_158 StaticText "N/A"
+ uid=7_159 StaticText "N/A"
+ uid=7_160 StaticText "N/A"
+ uid=7_161 StaticText "N/A"
+ uid=7_162 StaticText "N/A"
+ uid=7_163 StaticText "+3.5"
+ uid=7_164 StaticText "-105"
+ uid=7_165 StaticText "-3.5"
+ uid=7_166 StaticText "-115"
+ uid=7_167 StaticText "N/A"
+ uid=7_168 StaticText "N/A"
+ uid=7_169 StaticText "+3.5"
+ uid=7_170 StaticText "-110"
+ uid=7_171 StaticText "-3.5"
+ uid=7_172 StaticText "-110"
+ uid=7_173 StaticText "+4"
+ uid=7_174 StaticText "-110"
+ uid=7_175 StaticText "-4"
+ uid=7_176 StaticText "-110"
+ uid=7_177 StaticText "9:00 AM"
+ uid=7_178 StaticText "Arizona State vs. Colorado Game Highlights | 2025-26 Big 12 Men's Basketball"
+ uid=7_179 slider "Progress Bar" orientation="horizontal" value="0" valuemax="100" valuemin="0" valuetext=""
+ uid=7_180 button "Play" description="Play" disableable disabled
+ uid=7_181 button "Unmute" description="Unmute"
+ uid=7_182 slider "Volume Bar" orientation="horizontal" value="25" valuemax="100" valuemin="0" valuetext=""
+ uid=7_183 button "-04:58"
+ uid=7_184 button "Quality Selector" description="Quality Selector"
+ uid=7_185 button "Subtitles" description="Subtitles"
+ uid=7_186 button "Enter Fullscreen" description="Enter Fullscreen"
+ uid=7_187 link "UNC Greensboro Team Icon UNC Greensboro 849 Furman Team Icon Furman 850" url="https://www.actionnetwork.com/ncaab-game/unc-greensboro-furman-score-odds-february-8-2026/262751"
+ uid=7_188 image "UNC Greensboro Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/uncg.png"
+ uid=7_189 StaticText "UNC Greensboro"
+ uid=7_190 StaticText "849"
+ uid=7_191 image "Furman Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/fur.png"
+ uid=7_192 StaticText "Furman"
+ uid=7_193 StaticText "850"
+ uid=7_194 StaticText "+12.5"
+ uid=7_195 StaticText "-105"
+ uid=7_196 StaticText "-12.5"
+ uid=7_197 StaticText "-115"
+ uid=7_198 StaticText "+12.5"
+ uid=7_199 StaticText "-105"
+ uid=7_200 image "bet365 NJ Logo" url="https://assets.actionnetwork.com/32x32/240081_Bet365ALT.webp"
+ uid=7_201 StaticText "-12.5"
+ uid=7_202 StaticText "-110"
+ uid=7_203 image "BetMGM NJ Logo" url="https://assets.actionnetwork.com/32x32/40268_MGM48x48light@3x.webp"
+ uid=7_204 StaticText "+12.5"
+ uid=7_205 StaticText "-105"
+ uid=7_206 StaticText "-12.5"
+ uid=7_207 StaticText "-114"
+ uid=7_208 StaticText "+12.5"
+ uid=7_209 StaticText "-113"
+ uid=7_210 StaticText "-12.5"
+ uid=7_211 StaticText "-113"
+ uid=7_212 StaticText "N/A"
+ uid=7_213 StaticText "N/A"
+ uid=7_214 StaticText "N/A"
+ uid=7_215 StaticText "N/A"
+ uid=7_216 StaticText "N/A"
+ uid=7_217 StaticText "N/A"
+ uid=7_218 StaticText "N/A"
+ uid=7_219 StaticText "N/A"
+ uid=7_220 StaticText "+12.5"
+ uid=7_221 StaticText "-105"
+ uid=7_222 StaticText "-12.5"
+ uid=7_223 StaticText "-115"
+ uid=7_224 StaticText "N/A"
+ uid=7_225 StaticText "N/A"
+ uid=7_226 StaticText "+12.5"
+ uid=7_227 StaticText "-110"
+ uid=7_228 StaticText "-12.5"
+ uid=7_229 StaticText "-110"
+ uid=7_230 StaticText "+12.5"
+ uid=7_231 StaticText "-110"
+ uid=7_232 StaticText "-12.5"
+ uid=7_233 StaticText "-110"
+ uid=7_234 StaticText "10:00 AM"
+ uid=7_235 link "Michigan Team Icon Michigan 847 Ohio State Team Icon Ohio State 848" url="https://www.actionnetwork.com/ncaab-game/michigan-ohio-state-score-odds-february-8-2026/275064"
+ uid=7_236 image "Michigan Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/mich.png"
+ uid=7_237 StaticText "Michigan"
+ uid=7_238 StaticText "847"
+ uid=7_239 image "Ohio State Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/osu.png"
+ uid=7_240 StaticText "Ohio State"
+ uid=7_241 StaticText "848"
+ uid=7_242 StaticText "-9.5"
+ uid=7_243 StaticText "-105"
+ uid=7_244 StaticText "+9.5"
+ uid=7_245 StaticText "-115"
+ uid=7_246 StaticText "-9.5"
+ uid=7_247 StaticText "-105"
+ uid=7_248 image "bet365 NJ Logo" url="https://assets.actionnetwork.com/32x32/240081_Bet365ALT.webp"
+ uid=7_249 StaticText "+9.5"
+ uid=7_250 StaticText "-110"
+ uid=7_251 image "BetMGM NJ Logo" url="https://assets.actionnetwork.com/32x32/40268_MGM48x48light@3x.webp"
+ uid=7_252 StaticText "-9.5"
+ uid=7_253 StaticText "-108"
+ uid=7_254 StaticText "+9.5"
+ uid=7_255 StaticText "-112"
+ uid=7_256 StaticText "-9.5"
+ uid=7_257 StaticText "-113"
+ uid=7_258 StaticText "+9.5"
+ uid=7_259 StaticText "-113"
+ uid=7_260 StaticText "N/A"
+ uid=7_261 StaticText "N/A"
+ uid=7_262 StaticText "N/A"
+ uid=7_263 StaticText "N/A"
+ uid=7_264 StaticText "N/A"
+ uid=7_265 StaticText "N/A"
+ uid=7_266 StaticText "N/A"
+ uid=7_267 StaticText "N/A"
+ uid=7_268 StaticText "-9.5"
+ uid=7_269 StaticText "-105"
+ uid=7_270 StaticText "+9.5"
+ uid=7_271 StaticText "-115"
+ uid=7_272 StaticText "N/A"
+ uid=7_273 StaticText "N/A"
+ uid=7_274 StaticText "-9.5"
+ uid=7_275 StaticText "-110"
+ uid=7_276 StaticText "+9.5"
+ uid=7_277 StaticText "-110"
+ uid=7_278 StaticText "-9.5"
+ uid=7_279 StaticText "-110"
+ uid=7_280 StaticText "+9.5"
+ uid=7_281 StaticText "-110"
+ uid=7_282 StaticText "10:00 AM"
+ uid=7_283 link "Texas Tech Team Icon Texas Tech 845 West Virginia Team Icon West Virginia 846" url="https://www.actionnetwork.com/ncaab-game/texas-tech-west-virginia-score-odds-february-8-2026/275276"
+ uid=7_284 image "Texas Tech Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/tt.png"
+ uid=7_285 StaticText "Texas Tech"
+ uid=7_286 StaticText "845"
+ uid=7_287 image "West Virginia Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/wvu.png"
+ uid=7_288 StaticText "West Virginia"
+ uid=7_289 StaticText "846"
+ uid=7_290 StaticText "-4.5"
+ uid=7_291 StaticText "-110"
+ uid=7_292 StaticText "+4.5"
+ uid=7_293 StaticText "-110"
+ uid=7_294 StaticText "-4.5"
+ uid=7_295 StaticText "-110"
+ uid=7_296 image "Caesars NJ Logo" url="https://assets.actionnetwork.com/32x32/463646_Caesars.webp"
+ uid=7_297 StaticText "+4.5"
+ uid=7_298 StaticText "-105"
+ uid=7_299 image "bet365 NJ Logo" url="https://assets.actionnetwork.com/32x32/240081_Bet365ALT.webp"
+ uid=7_300 StaticText "-4.5"
+ uid=7_301 StaticText "-115"
+ uid=7_302 StaticText "+4.5"
+ uid=7_303 StaticText "-105"
+ uid=7_304 StaticText "-4.5"
+ uid=7_305 StaticText "-118"
+ uid=7_306 StaticText "+4.5"
+ uid=7_307 StaticText "-108"
+ uid=7_308 StaticText "N/A"
+ uid=7_309 StaticText "N/A"
+ uid=7_310 StaticText "N/A"
+ uid=7_311 StaticText "N/A"
+ uid=7_312 StaticText "N/A"
+ uid=7_313 StaticText "N/A"
+ uid=7_314 StaticText "N/A"
+ uid=7_315 StaticText "N/A"
+ uid=7_316 StaticText "-4.5"
+ uid=7_317 StaticText "-115"
+ uid=7_318 StaticText "+4.5"
+ uid=7_319 StaticText "-105"
+ uid=7_320 StaticText "N/A"
+ uid=7_321 StaticText "N/A"
+ uid=7_322 StaticText "-4.5"
+ uid=7_323 StaticText "-115"
+ uid=7_324 StaticText "+4.5"
+ uid=7_325 StaticText "-105"
+ uid=7_326 StaticText "-4.5"
+ uid=7_327 StaticText "-110"
+ uid=7_328 StaticText "+4.5"
+ uid=7_329 StaticText "-110"
+ uid=7_330 StaticText "10:00 AM"
+ uid=7_331 link "Maryland Team Icon Maryland 853 Minnesota Team Icon Minnesota 854" url="https://www.actionnetwork.com/ncaab-game/maryland-minnesota-score-odds-february-8-2026/275063"
+ uid=7_332 image "Maryland Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/mar.png"
+ uid=7_333 StaticText "Maryland"
+ uid=7_334 StaticText "853"
+ uid=7_335 image "Minnesota Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/min.png"
+ uid=7_336 StaticText "Minnesota"
+ uid=7_337 StaticText "854"
+ uid=7_338 StaticText "+8.5"
+ uid=7_339 StaticText "-102"
+ uid=7_340 StaticText "-8.5"
+ uid=7_341 StaticText "-120"
+ uid=7_342 StaticText "+8.5"
+ uid=7_343 StaticText "-105"
+ uid=7_344 image "bet365 NJ Logo" url="https://assets.actionnetwork.com/32x32/240081_Bet365ALT.webp"
+ uid=7_345 StaticText "-8.5"
+ uid=7_346 StaticText "-110"
+ uid=7_347 image "BetMGM NJ Logo" url="https://assets.actionnetwork.com/32x32/40268_MGM48x48light@3x.webp"
+ uid=7_348 StaticText "+8.5"
+ uid=7_349 StaticText "-108"
+ uid=7_350 StaticText "-8.5"
+ uid=7_351 StaticText "-112"
+ uid=7_352 StaticText "+8.5"
+ uid=7_353 StaticText "-109"
+ uid=7_354 StaticText "-8.5"
+ uid=7_355 StaticText "-115"
+ uid=7_356 StaticText "N/A"
+ uid=7_357 StaticText "N/A"
+ uid=7_358 StaticText "N/A"
+ uid=7_359 StaticText "N/A"
+ uid=7_360 StaticText "N/A"
+ uid=7_361 StaticText "N/A"
+ uid=7_362 StaticText "N/A"
+ uid=7_363 StaticText "N/A"
+ uid=7_364 StaticText "+8.5"
+ uid=7_365 StaticText "-105"
+ uid=7_366 StaticText "-8.5"
+ uid=7_367 StaticText "-115"
+ uid=7_368 StaticText "N/A"
+ uid=7_369 StaticText "N/A"
+ uid=7_370 StaticText "+8.5"
+ uid=7_371 StaticText "-110"
+ uid=7_372 StaticText "-8.5"
+ uid=7_373 StaticText "-110"
+ uid=7_374 StaticText "+8.5"
+ uid=7_375 StaticText "-110"
+ uid=7_376 StaticText "-8.5"
+ uid=7_377 StaticText "-110"
+ uid=7_378 StaticText "11:00 AM"
+ uid=7_379 link "UCF Team Icon UCF 857 Cincinnati Team Icon Cincinnati 858" url="https://www.actionnetwork.com/ncaab-game/ucf-cincinnati-score-odds-february-8-2026/275275"
+ uid=7_380 image "UCF Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/ucf.png"
+ uid=7_381 StaticText "UCF"
+ uid=7_382 StaticText "857"
+ uid=7_383 image "Cincinnati Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/cind.png"
+ uid=7_384 StaticText "Cincinnati"
+ uid=7_385 StaticText "858"
+ uid=7_386 StaticText "+3.5"
+ uid=7_387 StaticText "-110"
+ uid=7_388 StaticText "-3.5"
+ uid=7_389 StaticText "-110"
+ uid=7_390 StaticText "+3.5"
+ uid=7_391 StaticText "-107"
+ uid=7_392 image "BetRivers NJ Logo" url="https://assets.actionnetwork.com/32x32/341144_BetRiver48x48light@2x.webp"
+ uid=7_393 StaticText "-3.5"
+ uid=7_394 StaticText "-110"
+ uid=7_395 image "bet365 NJ Logo" url="https://assets.actionnetwork.com/32x32/240081_Bet365ALT.webp"
+ uid=7_396 StaticText "+3.5"
+ uid=7_397 StaticText "-110"
+ uid=7_398 StaticText "-3.5"
+ uid=7_399 StaticText "-110"
+ uid=7_400 StaticText "+3.5"
+ uid=7_401 StaticText "-107"
+ uid=7_402 StaticText "-3.5"
+ uid=7_403 StaticText "-120"
+ uid=7_404 StaticText "N/A"
+ uid=7_405 StaticText "N/A"
+ uid=7_406 StaticText "N/A"
+ uid=7_407 StaticText "N/A"
+ uid=7_408 StaticText "N/A"
+ uid=7_409 StaticText "N/A"
+ uid=7_410 StaticText "N/A"
+ uid=7_411 StaticText "N/A"
+ uid=7_412 StaticText "+3.5"
+ uid=7_413 StaticText "-110"
+ uid=7_414 StaticText "-3.5"
+ uid=7_415 StaticText "-110"
+ uid=7_416 StaticText "N/A"
+ uid=7_417 StaticText "N/A"
+ uid=7_418 StaticText "+3.5"
+ uid=7_419 StaticText "-110"
+ uid=7_420 StaticText "-3.5"
+ uid=7_421 StaticText "-110"
+ uid=7_422 StaticText "+3.5"
+ uid=7_423 StaticText "-110"
+ uid=7_424 StaticText "-3.5"
+ uid=7_425 StaticText "-110"
+ uid=7_426 StaticText "11:00 AM"
+ uid=7_427 link "Wichita State Team Icon Wichita State 851 Tulane Team Icon Tulane 852" url="https://www.actionnetwork.com/ncaab-game/wichita-state-tulane-score-odds-february-8-2026/276018"
+ uid=7_428 image "Wichita State Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/wich.png"
+ uid=7_429 StaticText "Wichita State"
+ uid=7_430 StaticText "851"
+ uid=7_431 image "Tulane Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/tul.png"
+ uid=7_432 StaticText "Tulane"
+ uid=7_433 StaticText "852"
+ uid=7_434 StaticText "-3.5"
+ uid=7_435 StaticText "-110"
+ uid=7_436 StaticText "+3.5"
+ uid=7_437 StaticText "-110"
+ uid=7_438 StaticText "-3.5"
+ uid=7_439 StaticText "-110"
+ uid=7_440 image "BetMGM NJ Logo" url="https://assets.actionnetwork.com/32x32/40268_MGM48x48light@3x.webp"
+ uid=7_441 StaticText "+3.5"
+ uid=7_442 StaticText "-105"
+ uid=7_443 image "bet365 NJ Logo" url="https://assets.actionnetwork.com/32x32/240081_Bet365ALT.webp"
+ uid=7_444 StaticText "-3.5"
+ uid=7_445 StaticText "-112"
+ uid=7_446 StaticText "+3.5"
+ uid=7_447 StaticText "-107"
+ uid=7_448 StaticText "-3.5"
+ uid=7_449 StaticText "-120"
+ uid=7_450 StaticText "+3.5"
+ uid=7_451 StaticText "-107"
+ uid=7_452 StaticText "N/A"
+ uid=7_453 StaticText "N/A"
+ uid=7_454 StaticText "N/A"
+ uid=7_455 StaticText "N/A"
+ uid=7_456 StaticText "N/A"
+ uid=7_457 StaticText "N/A"
+ uid=7_458 StaticText "N/A"
+ uid=7_459 StaticText "N/A"
+ uid=7_460 StaticText "-3.5"
+ uid=7_461 StaticText "-115"
+ uid=7_462 StaticText "+3.5"
+ uid=7_463 StaticText "-105"
+ uid=7_464 StaticText "N/A"
+ uid=7_465 StaticText "N/A"
+ uid=7_466 StaticText "-3.5"
+ uid=7_467 StaticText "-110"
+ uid=7_468 StaticText "+3.5"
+ uid=7_469 StaticText "-110"
+ uid=7_470 StaticText "-3.5"
+ uid=7_471 StaticText "-110"
+ uid=7_472 StaticText "+3.5"
+ uid=7_473 StaticText "-110"
+ uid=7_474 StaticText "11:00 AM"
+ uid=7_475 link "Charlotte Team Icon Charlotte 855 Memphis Team Icon Memphis 856" url="https://www.actionnetwork.com/ncaab-game/charlotte-memphis-score-odds-february-8-2026/276019"
+ uid=7_476 image "Charlotte Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/cha.png"
+ uid=7_477 StaticText "Charlotte"
+ uid=7_478 StaticText "855"
+ uid=7_479 image "Memphis Team Icon" url="https://assets.actionnetwork.com/26320_memphis.png"
+ uid=7_480 StaticText "Memphis"
+ uid=7_481 StaticText "856"
+ uid=7_482 StaticText "+8.5"
+ uid=7_483 StaticText "-111"
+ uid=7_484 StaticText "-8.5"
+ uid=7_485 StaticText "-108"
+ uid=7_486 StaticText "+9.5"
+ uid=7_487 StaticText "-115"
+ uid=7_488 image "BetMGM NJ Logo" url="https://assets.actionnetwork.com/32x32/40268_MGM48x48light@3x.webp"
+ uid=7_489 StaticText "-8.5"
+ uid=7_490 StaticText "-112"
+ uid=7_491 image "BetRivers NJ Logo" url="https://assets.actionnetwork.com/32x32/341144_BetRiver48x48light@2x.webp"
+ uid=7_492 StaticText "+9.5"
+ uid=7_493 StaticText "-115"
+ uid=7_494 StaticText "-9.5"
+ uid=7_495 StaticText "-105"
+ uid=7_496 StaticText "+8.5"
+ uid=7_497 StaticText "-114"
+ uid=7_498 StaticText "-8.5"
+ uid=7_499 StaticText "-112"
+ uid=7_500 StaticText "N/A"
+ uid=7_501 StaticText "N/A"
+ uid=7_502 StaticText "N/A"
+ uid=7_503 StaticText "N/A"
+ uid=7_504 StaticText "N/A"
+ uid=7_505 StaticText "N/A"
+ uid=7_506 StaticText "N/A"
+ uid=7_507 StaticText "N/A"
+ uid=7_508 StaticText "+9"
+ uid=7_509 StaticText "-110"
+ uid=7_510 StaticText "-9"
+ uid=7_511 StaticText "-110"
+ uid=7_512 StaticText "N/A"
+ uid=7_513 StaticText "N/A"
+ uid=7_514 StaticText "+9.5"
+ uid=7_515 StaticText "-115"
+ uid=7_516 StaticText "-9.5"
+ uid=7_517 StaticText "-105"
+ uid=7_518 StaticText "+9"
+ uid=7_519 StaticText "-110"
+ uid=7_520 StaticText "-9"
+ uid=7_521 StaticText "-110"
+ uid=7_522 StaticText "11:00 AM"
+ uid=7_523 link "Northwestern Team Icon Northwestern 861 Iowa Team Icon Iowa 862" url="https://www.actionnetwork.com/ncaab-game/northwestern-iowa-score-odds-february-8-2026/275065"
+ uid=7_524 image "Northwestern Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/nw.png"
+ uid=7_525 StaticText "Northwestern"
+ uid=7_526 StaticText "861"
+ uid=7_527 image "Iowa Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/iowd.png"
+ uid=7_528 StaticText "Iowa"
+ uid=7_529 StaticText "862"
+ uid=7_530 StaticText "+12.5"
+ uid=7_531 StaticText "-102"
+ uid=7_532 StaticText "-12.5"
+ uid=7_533 StaticText "-120"
+ uid=7_534 StaticText "+12.5"
+ uid=7_535 StaticText "-105"
+ uid=7_536 image "bet365 NJ Logo" url="https://assets.actionnetwork.com/32x32/240081_Bet365ALT.webp"
+ uid=7_537 StaticText "-12.5"
+ uid=7_538 StaticText "-110"
+ uid=7_539 image "Caesars NJ Logo" url="https://assets.actionnetwork.com/32x32/463646_Caesars.webp"
+ uid=7_540 StaticText "+12.5"
+ uid=7_541 StaticText "-105"
+ uid=7_542 StaticText "-12.5"
+ uid=7_543 StaticText "-115"
+ uid=7_544 StaticText "+12.5"
+ uid=7_545 StaticText "-113"
+ uid=7_546 StaticText "-12.5"
+ uid=7_547 StaticText "-113"
+ uid=7_548 StaticText "N/A"
+ uid=7_549 StaticText "N/A"
+ uid=7_550 StaticText "N/A"
+ uid=7_551 StaticText "N/A"
+ uid=7_552 StaticText "N/A"
+ uid=7_553 StaticText "N/A"
+ uid=7_554 StaticText "N/A"
+ uid=7_555 StaticText "N/A"
+ uid=7_556 StaticText "+12.5"
+ uid=7_557 StaticText "-105"
+ uid=7_558 StaticText "-12.5"
+ uid=7_559 StaticText "-115"
+ uid=7_560 StaticText "N/A"
+ uid=7_561 StaticText "N/A"
+ uid=7_562 StaticText "+12.5"
+ uid=7_563 StaticText "-105"
+ uid=7_564 StaticText "-12.5"
+ uid=7_565 StaticText "-115"
+ uid=7_566 StaticText "+12.5"
+ uid=7_567 StaticText "-110"
+ uid=7_568 StaticText "-12.5"
+ uid=7_569 StaticText "-110"
+ uid=7_570 StaticText "12:00 PM"
+ uid=7_571 link "Rice Team Icon Rice 859 UAB Team Icon UAB 860" url="https://www.actionnetwork.com/ncaab-game/rice-uab-score-odds-february-8-2026/276017"
+ uid=7_572 image "Rice Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/riced.png"
+ uid=7_573 StaticText "Rice"
+ uid=7_574 StaticText "859"
+ uid=7_575 image "UAB Team Icon" url="https://static.sprtactn.co/teamlogos/ncaab/100/uab.png"
+ uid=7_576 StaticText "UAB"
+ uid=7_577 StaticText "860"
+ uid=7_578 StaticText "+8.5"
+ uid=7_579 StaticText "-111"
+ uid=7_580 StaticText "-8.5"
+ uid=7_581 StaticText "-108"
+ uid=7_582 StaticText "+8.5"
+ uid=7_583 StaticText "-110"
+ uid=7_584 image "Caesars NJ Logo" url="https://assets.actionnetwork.com/32x32/463646_Caesars.webp"
+ uid=7_585 StaticText "-8.5"
+ uid=7_586 StaticText "-105"
+ uid=7_587 image "bet365 NJ Logo" url="https://assets.actionnetwork.com/32x32/240081_Bet365ALT.webp"
+ uid=7_588 StaticText "+8.5"
+ uid=7_589 StaticText "-115"
+ uid=7_590 StaticText "-8.5"
+ uid=7_591 StaticText "-105"
+ uid=7_592 StaticText "+8.5"
+ uid=7_593 StaticText "-112"
+ uid=7_594 StaticText "-8.5"
+ uid=7_595 StaticText "-114"
+ uid=7_596 StaticText "N/A"
+ uid=7_597 StaticText "N/A"
+ uid=7_598 StaticText "N/A"
+ uid=7_599 StaticText "N/A"
+ uid=7_600 StaticText "N/A"
+ uid=7_601 StaticText "N/A"
+ uid=7_602 StaticText "N/A"
+ uid=7_603 StaticText "N/A"
+ uid=7_604 StaticText "+8.5"
+ uid=7_605 StaticText "-115"
+ uid=7_606 StaticText "-8.5"
+ uid=7_607 StaticText "-105"
+ uid=7_608 StaticText "N/A"
+ uid=7_609 StaticText "N/A"
+ uid=7_610 StaticText "+8.5"
+ uid=7_611 StaticText "-115"
+ uid=7_612 StaticText "-8.5"
+ uid=7_613 StaticText "-105"
+ uid=7_614 StaticText "+8.5"
+ uid=7_615 StaticText "-110"
+ uid=7_616 StaticText "-8.5"
+ uid=7_617 StaticText "-110"
+ uid=7_618 StaticText "12:00 PM"
+ uid=7_619 complementary
+ uid=7_620 heading "Offers" level="2"
+ uid=7_621 combobox expandable haspopup="menu" value="California"
+ uid=7_622 option "Select a location" selectable value="Select a location"
+ uid=7_623 option "Alabama" selectable value="Alabama"
+ uid=7_624 option "Alaska" selectable value="Alaska"
+ uid=7_625 option "Arizona" selectable value="Arizona"
+ uid=7_626 option "Arkansas" selectable value="Arkansas"
+ uid=7_627 option "California" selectable selected value="California"
+ uid=7_628 option "Colorado" selectable value="Colorado"
+ uid=7_629 option "Connecticut" selectable value="Connecticut"
+ uid=7_630 option "Delaware" selectable value="Delaware"
+ uid=7_631 option "Florida" selectable value="Florida"
+ uid=7_632 option "Georgia" selectable value="Georgia"
+ uid=7_633 option "Hawaii" selectable value="Hawaii"
+ uid=7_634 option "Idaho" selectable value="Idaho"
+ uid=7_635 option "Illinois" selectable value="Illinois"
+ uid=7_636 option "Indiana" selectable value="Indiana"
+ uid=7_637 option "Iowa" selectable value="Iowa"
+ uid=7_638 option "Kansas" selectable value="Kansas"
+ uid=7_639 option "Kentucky" selectable value="Kentucky"
+ uid=7_640 option "Louisiana" selectable value="Louisiana"
+ uid=7_641 option "Maine" selectable value="Maine"
+ uid=7_642 option "Maryland" selectable value="Maryland"
+ uid=7_643 option "Massachusetts" selectable value="Massachusetts"
+ uid=7_644 option "Michigan" selectable value="Michigan"
+ uid=7_645 option "Minnesota" selectable value="Minnesota"
+ uid=7_646 option "Mississippi" selectable value="Mississippi"
+ uid=7_647 option "Missouri" selectable value="Missouri"
+ uid=7_648 option "Montana" selectable value="Montana"
+ uid=7_649 option "Nebraska" selectable value="Nebraska"
+ uid=7_650 option "Nevada" selectable value="Nevada"
+ uid=7_651 option "New Hampshire" selectable value="New Hampshire"
+ uid=7_652 option "New Jersey" selectable value="New Jersey"
+ uid=7_653 option "New Mexico" selectable value="New Mexico"
+ uid=7_654 option "New York" selectable value="New York"
+ uid=7_655 option "North Carolina" selectable value="North Carolina"
+ uid=7_656 option "North Dakota" selectable value="North Dakota"
+ uid=7_657 option "Ohio" selectable value="Ohio"
+ uid=7_658 option "Oklahoma" selectable value="Oklahoma"
+ uid=7_659 option "Oregon" selectable value="Oregon"
+ uid=7_660 option "Pennsylvania" selectable value="Pennsylvania"
+ uid=7_661 option "Rhode Island" selectable value="Rhode Island"
+ uid=7_662 option "South Carolina" selectable value="South Carolina"
+ uid=7_663 option "South Dakota" selectable value="South Dakota"
+ uid=7_664 option "Tennessee" selectable value="Tennessee"
+ uid=7_665 option "Texas" selectable value="Texas"
+ uid=7_666 option "Utah" selectable value="Utah"
+ uid=7_667 option "Vermont" selectable value="Vermont"
+ uid=7_668 option "Virginia" selectable value="Virginia"
+ uid=7_669 option "Washington" selectable value="Washington"
+ uid=7_670 option "West Virginia" selectable value="West Virginia"
+ uid=7_671 option "Wisconsin" selectable value="Wisconsin"
+ uid=7_672 option "Wyoming" selectable value="Wyoming"
+ uid=7_673 option "Washington, D.C." selectable value="Washington, D.C."
+ uid=7_674 image "Right Arrow"
+ uid=7_675 image "promotion logo" url="https://assets.bet-links.com/900x900/517312_UnderdogFantasy.webp"
+ uid=7_676 link "Underdog Play $5, Get $75 in Fantasy Bonus Entries!" url="https://switchboard.actionnetwork.com/offers?affiliateId=766&campaignId=1487&stateCode=CA&context=web-right-rail-odds&propertyId=1"
+ uid=7_677 StaticText "Underdog"
+ uid=7_678 StaticText "Play $5, Get $75 in Fantasy Bonus Entries!"
+ uid=7_679 StaticText "Age, location, and other restrictions apply. Please play responsibly."
+ uid=7_680 StaticText "ACTION"
+ uid=7_681 StaticText "Promo Code"
+ uid=7_682 link "Claim $75" url="https://switchboard.actionnetwork.com/offers?affiliateId=766&campaignId=1487&stateCode=CA&context=web-right-rail-odds&propertyId=1"
+ uid=7_683 StaticText "Claim $75"
+ uid=7_684 image "promotion logo" url="https://assets.bet-links.com/900x900/497595_unnamed1.webp"
+ uid=7_685 link "Dabble Fantasy Get $10 When You Sign-Up to Dabble!" url="https://switchboard.actionnetwork.com/offers?affiliateId=1795&campaignId=6322&stateCode=CA&context=web-right-rail-odds&propertyId=1"
+ uid=7_686 StaticText "Dabble Fantasy"
+ uid=7_687 StaticText "Get $10 When You Sign-Up to Dabble!"
+ uid=7_688 StaticText "Age, location, and other restrictions apply. Please play responsibly."
+ uid=7_689 StaticText "ACTION"
+ uid=7_690 StaticText "Promo Code"
+ uid=7_691 link "Claim $10" url="https://switchboard.actionnetwork.com/offers?affiliateId=1795&campaignId=6322&stateCode=CA&context=web-right-rail-odds&propertyId=1"
+ uid=7_692 StaticText "Claim $10"
+ uid=7_693 image "promotion logo" url="https://assets.bet-links.com/900x900/666661_Group530655.webp"
+ uid=7_694 link "Sleeper Get Free $20+100% Deposit Match up to $100!" url="https://switchboard.actionnetwork.com/offers?affiliateId=502&campaignId=6405&stateCode=CA&context=web-right-rail-odds&propertyId=1"
+ uid=7_695 StaticText "Sleeper"
+ uid=7_696 StaticText "Get Free $20+100% Deposit Match up to $100!"
+ uid=7_697 StaticText "Age, location, and other restrictions apply. Please play responsibly."
+ uid=7_698 StaticText "ACTION"
+ uid=7_699 StaticText "Promo Code"
+ uid=7_700 link "Claim $120" url="https://switchboard.actionnetwork.com/offers?affiliateId=502&campaignId=6405&stateCode=CA&context=web-right-rail-odds&propertyId=1"
+ uid=7_701 StaticText "Claim $120"
+ uid=7_702 image "promotion logo"
+ uid=7_703 link "DraftKings Pick6 Play $5 Get $60 in Pick6 Bonus Picks + Super Bowl LX Gimme Pick!" url="https://switchboard.actionnetwork.com/offers?affiliateId=1860&campaignId=6505&stateCode=CA&context=web-right-rail-odds&propertyId=1"
+ uid=7_704 StaticText "DraftKings Pick6"
+ uid=7_705 StaticText "Play $5 Get $60 in Pick6 Bonus Picks + Super Bowl LX Gimme Pick!"
+ uid=7_706 StaticText "Age, location, and other restrictions apply. Please play responsibly."
+ uid=7_707 StaticText "No Code Needed"
+ uid=7_708 StaticText "Promo Code"
+ uid=7_709 link "Claim $60" url="https://switchboard.actionnetwork.com/offers?affiliateId=1860&campaignId=6505&stateCode=CA&context=web-right-rail-odds&propertyId=1"
+ uid=7_710 StaticText "Claim $60"
+ uid=7_711 button "See all"
+ uid=7_712 StaticText "NCAAB Teams"
+ uid=7_713 link "All Teams" url="https://www.actionnetwork.com/ncaab/teams"
+ uid=7_714 StaticText "All Teams"
+ uid=7_715 link "Auburn Tigers" url="https://www.actionnetwork.com/ncaab/odds/auburn-tigers"
+ uid=7_716 StaticText "Auburn Tigers"
+ uid=7_717 link "Gonzaga Bulldogs" url="https://www.actionnetwork.com/ncaab/odds/gonzaga-bulldogs"
+ uid=7_718 StaticText "Gonzaga Bulldogs"
+ uid=7_719 link "Kansas Jayhawks" url="https://www.actionnetwork.com/ncaab/odds/kansas-jayhawks"
+ uid=7_720 StaticText "Kansas Jayhawks"
+ uid=7_721 link "Purdue Boilermakers" url="https://www.actionnetwork.com/ncaab/odds/purdue-boilermakers"
+ uid=7_722 StaticText "Purdue Boilermakers"
+ uid=7_723 link "Baylor Bears" url="https://www.actionnetwork.com/ncaab/odds/baylor-bears"
+ uid=7_724 StaticText "Baylor Bears"
+ uid=7_725 heading "Where Can I Bet on College Basketball?" level="2"
+ uid=7_726 StaticText "Legal online sports betting is currently available in many U.S. states, including Washington, D.C.! You can see the status of sports betting in your state with "
+ uid=7_727 link "Action Network's legalization tracker" url="https://www.actionnetwork.com/online-sports-betting"
+ uid=7_728 StaticText "Action Network's legalization tracker"
+ uid=7_729 StaticText ". There are several states that are in the process of creating or voting on legislation that would legalize mobile sports betting. Be sure to follow the news if you're in a pending state. "
+ uid=7_730 heading "College Basketball Odds" level="2"
+ uid=7_731 StaticText "College basketball is one of the most exciting sports to bet, especially come tournament time. March Madness brackets aside, there are a variety of ways that residents of states where sports betting is legalized can get in on the action. This page will help those new to betting understand some of the basic terminology as well as provide details on how to read and understand college basketball odds. We will use a game between the "
+ uid=7_732 link "Iowa Hawkeyes" url="https://www.actionnetwork.com/ncaab/odds/iowa-hawkeyes"
+ uid=7_733 StaticText "Iowa Hawkeyes"
+ uid=7_734 StaticText " and the "
+ uid=7_735 link "Maryland Terrapins" url="https://www.actionnetwork.com/ncaab/odds/maryland-terrapins"
+ uid=7_736 StaticText "Maryland Terrapins"
+ uid=7_737 StaticText " as an example throughout this college basketball betting guide."
+ uid=7_738 heading "College Basketball Odds Table Example" level="2"
+ uid=7_739 heading "Types of College Basketball Bets" level="2"
+ uid=7_740 StaticText "Understanding the various types of college basketball bets, and how they payout, is a must before placing a wager with any confidence. The most common betting types that you will encounter with college basketball are:"
+ uid=7_741 StaticText "Moneyline"
+ uid=7_742 link "Against the Spread" url="https://www.actionnetwork.com/education/point-spread"
+ uid=7_743 StaticText "Against the Spread"
+ uid=7_744 StaticText "Over/Under Totals"
+ uid=7_745 heading "College Basketball Moneylines" level="3"
+ uid=7_746 StaticText "College basketball "
+ uid=7_747 link "moneyline bets" url="https://www.actionnetwork.com/education/moneyline"
+ uid=7_748 StaticText "moneyline bets"
+ uid=7_749 StaticText " simply require a bettor to select the winner of a particular contest. Almost every single game will have a favorite team and an underdog team. The ""
+ uid=7_750 link "favorite" url="https://www.actionnetwork.com/education/favorite"
+ uid=7_751 StaticText "favorite"
+ uid=7_752 StaticText "" is the team that is expected to win the contest, and conversely, the "
+ uid=7_753 link "underdog" url="https://www.actionnetwork.com/education/underdog"
+ uid=7_754 StaticText "underdog"
+ uid=7_755 StaticText " is the team that is expected to lose. You can tell which team is "
+ uid=7_756 StaticText "the favorite in an odds table as they will be designated with a minus sign (-) while the underdog will be given a plus sign (+). "
+ uid=7_757 StaticText "Here's how the moneyline bet looks for our selected college basketball matchup:"
+ uid=7_758 image "College Basketball Moneylines Example"
+ uid=7_759 StaticText "You can tell that Iowa is the favored team in this game because they have a -177 designation moneyline. Maryland is the underdog with the +145 moneyline. "
+ uid=7_760 StaticText "The specific numbers next to the plus or minus signs indicate the associated payout amount for betting on each team. "
+ uid=7_761 StaticText "In the Iowa vs. Maryland matchup, a $100 bet on Maryland at +145 odds would result in a $145 profit if they won the game. This would give you a total payout of $245 with your original $100 wager included. By contrast, Iowa's moneyline of -177 means that a bettor would win $100 for every $177 invested. Bettors must risk more money to profit when placing a wager on a favorite as compared to an underdog."
+ uid=7_762 heading "College Basketball Point Spreads" level="3"
+ uid=7_763 StaticText "The point spread may also be referred to as a margin of victory bet, bets against the spread, or simply, the spread. In this type of wager, the bettor has correctly pick which team will win or lose by a specific amount of points. "
+ uid=7_764 StaticText "To cover a spread, the selected team must beat the point spread that was assigned by oddsmakers for a particular contest."
+ uid=7_765 StaticText " Similar to a moneyline bet, a minus sign (-) is assigned to the favorite team. The number next to this minus sign is how many points the favored team has to win by in order to cover the bet. A bet on the underdog will, if the team wins the game outright or loses by less than allotted point spread. Here's the spread for Iowa vs. Maryland:"
+ uid=7_766 image "College Basketball Point Spreads Example"
+ uid=7_767 StaticText "Aside from the spread number, bettors also need to be aware of the"
+ uid=7_768 StaticText " "
+ uid=7_769 link "juice" url="https://www.actionnetwork.com/education/juice"
+ uid=7_770 StaticText "juice"
+ uid=7_771 StaticText ", or the vig, which is the “tax” that betters pay to a sportsbook to place their wager"
+ uid=7_772 StaticText ". Typically, you'll see this number directly below the spread in a smaller font. In this game, each team has their own juice number, meaning that the tax you pay to the sportsbook will be different depending on the team you wager on. The juice is -112 for Iowa and -108 for Maryland. "
+ uid=7_773 StaticText "The vig for the spread works the same as a moneyline when calculating a bet's potential payout. Betting on Iowa would net someone $100 for every $112 invested if they covered the spread whereas a bettor would earn $100 for every $108 invested on Maryland if they covered the spread."
+ uid=7_774 StaticText "What happens, you may wonder, if Iowa wins the game by exactly 4 points? This scenario is what is known as a "
+ uid=7_775 StaticText "“"
+ uid=7_776 link "push" url="https://www.actionnetwork.com/education/push"
+ uid=7_777 StaticText "push"
+ uid=7_778 StaticText "”. A push means that no team covered the spread and you will get the money back that you placed on the wager"
+ uid=7_779 StaticText ". In some instances, oddsmakers will set an even spread, which means that they see each team as likely to win the contest as the other. This is commonly referred to as a “"
+ uid=7_780 link "pick ‘em" url="https://www.actionnetwork.com/education/pickem"
+ uid=7_781 StaticText "pick ‘em"
+ uid=7_782 StaticText "” matchup."
+ uid=7_783 heading "College Basketball Over/Unders" level="3"
+ uid=7_784 StaticText "Over/Unders, or totals, are another typical college basketball bet. Betting "
+ uid=7_785 link "over/under" url="https://www.actionnetwork.com/education/over-under-total"
+ uid=7_786 StaticText "over/under"
+ uid=7_787 StaticText " means deciding if both teams will combine to score more or less than a specific point total assigned by oddsmakers for the contest. The winner or loser of the game is irrelevant in this wager. The bettor is only concerned with the combined point total regardless of the outcome."
+ uid=7_788 image "College Basketball Over/Unders Example"
+ uid=7_789 StaticText "Oddsmakers have set a total of 146.5 points for this Iowa vs. Maryland game. To win on an "Over" bet, the two teams must combine to score 147 points or more. To win on an "Under" bet, the two teams must combine to score 146 points or less."
+ uid=7_790 StaticText "Over/Under bets also have juice, which is also indicated underneath the the over/under totals in the table. For Iowa vs. Maryland, bettors will have to invest $108 for every $100 worth of profit when betting the over and bet $112 for every $100 worth of profit when betting the under."
+ uid=7_791 heading "How to Use the Action Network's College Basketball Odds Page" level="2"
+ uid=7_792 StaticText "You will likely notice that moneylines, point spreads, and over/under totals vary from one sportsbook to the next for the same college basketball game. Seasoned bettors know that it pays to "shop for lines" before making a wager. All this means is that it is best practice to look at each sportsbook's line for a game that you are interested in betting on to ensure that you are putting your money in the place that offers the highest potential payout. To make this work easier for you, The Action Network has compiled odds from each sportsbook for all of the college basketball games in a given day. "
+ uid=7_793 StaticText "Here are a few notes on how to get the most out of the College Basketball Odds page."
+ uid=7_794 StaticText "You’ll never have to guess which sportsbook has the best odds for a given game. The Action Network pulls in up-to-the-minute odds from every U.S. sportsbook and denotes the most favorable odds in "Best Odds" column for every bet type. This not only saves you time, but helps ensure that you're always putting yourself in the best position to earn more per wager."
+ uid=7_795 StaticText "Filter games by specific bet types (moneyline, spread, over/unders) or show all three at once. Whether you want to see a holistic view of the day's action or drill into a specific betting market, our odds page is customizable to meet your preference."
+ uid=7_796 StaticText "Action Network subscribers can further customize the page by only displaying the sportsbooks that you have an account with. This helps simplify the page and streamline your betting process."
+ uid=7_797 StaticText "Sportsbooks offer odds on specific timeframes within a game in addition to just the final outcome. For the College Basketball, this often includes specific odds for each quarter or half. Our College Basketball Odds page allows you to filter between the various game timeframes to bet a contest in a variety of different ways as it plays out in real-time."
+ uid=7_798 StaticText "Sports betting is not currently legal in every state. As such, The Action Network College Basketball Odds page will allow you to see the sportsbook odds and prices available to you depending on where you are in the United States."
+ uid=7_799 heading "Other Ways to Bet on College Basketball" level="2"
+ uid=7_800 StaticText "There are other options available for those looking for more ways to get into College Basketball betting including:"
+ uid=7_801 StaticText "Parlay"
+ uid=7_802 StaticText "Futures"
+ uid=7_803 StaticText "Player Props"
+ uid=7_804 StaticText "Daily Fantasy"
+ uid=7_805 heading "College Basketball Parlays" level="3"
+ uid=7_806 StaticText "A "
+ uid=7_807 link "parlay" url="https://www.actionnetwork.com/education/parlay"
+ uid=7_808 StaticText "parlay"
+ uid=7_809 StaticText " is combining two or more wagers into a single betting ticket with the goal of increasing the potential payout. A bettor must win every wager on the ticket in order to win the parlay. While t"
+ uid=7_810 StaticText "his makes the chances of winning more difficult, parlay bets increase your potential earnings. The more bets added to a parlay, the higher the payout potential."
+ uid=7_811 StaticText "Bettors can create a parlay by grouping any type of bets together, like a moneyline and an over/under total. Let's create an Iowa moneyline and over 146.5 parlay using our sample matchup."
+ uid=7_812 heading "College Basketball Parlays Example" level="2"
+ uid=7_813 StaticText "Here’s how the two bets payout separately with a $100 budget:"
+ uid=7_814 StaticText "Iowa moneyline (-177) at $50"
+ uid=7_815 StaticText "The potential winnings are $28.25 for a total payout of $78.25 including the original $50 risked."
+ uid=7_816 StaticText "Over 146.5 (-108) at $50"
+ uid=7_817 StaticText "The potential winnings are $46.30 for a total payout of $96.30 including the original $50 risked."
+ uid=7_818 StaticText "A bettor has the potential to win $74.55 in payouts for a total of $174.55 including the original $100 risked across the two separate bets."
+ uid=7_819 StaticText "Here’s how the same bets would payout as a parlay:"
+ uid=7_820 StaticText "Iowa moneyline (-177) AND Over 146.5 (-108) at $100"
+ uid=7_821 StaticText "The potential winnings are $201.41 for a total payout of $301.41 including the original $100 risked."
+ uid=7_822 StaticText "The payout potential is significantly higher for the parlay bet as compared to the two separate bets despite the total money being risked being across identical bets. Because both bets have to cover in order to win, oddsmakers will give you better payout odds as you add more wagers to your parlay."
+ uid=7_823 StaticText "Need help calculating your parlay payouts? Don’t forget to take advantage of The Action Network’s "
+ uid=7_824 link "Betting Odds Calculator" url="https://www.actionnetwork.com/betting-calculators/betting-odds-calculator"
+ uid=7_825 StaticText "Betting Odds Calculator"
+ uid=7_826 StaticText "."
+ uid=7_827 heading "College Basketball Futures" level="3"
+ uid=7_828 StaticText "Futures bets are exactly what they sound like: betting on an event where the outcome will be settled in the future (not within the same day or week). Futures bets exclusively deal with moneylines. Examples of futures betting for in College Basketball include placing a wager on a team to win their conference tournament, make the "
+ uid=7_829 link "Final Four" url="https://www.actionnetwork.com/ncaab/futures/final-four-odds"
+ uid=7_830 StaticText "Final Four"
+ uid=7_831 StaticText ", or win the "
+ uid=7_832 link "NCAA Tournament" url="https://www.actionnetwork.com/ncaab/futures/ncaa-tournament-title-odds"
+ uid=7_833 StaticText "NCAA Tournament"
+ uid=7_834 StaticText "."
+ uid=7_835 StaticText "Futures bets payout very well because they are incredibly difficult to get right. Typically, a futures bet is placed at the beginning of the college basketball season and requires a bettor to accurately predict an outcome that is months into the future. The best odds are available earlier in the college basketball season, but bettors are able to make futures bets throughout the course of the season as well. The odds are updated regularly to reflect a given team's performance throughout the course of the year, so the odds may not be as favorable later on."
+ uid=7_836 heading "College Basketball Prop Bets" level="3"
+ uid=7_837 StaticText "Want to bet on particular college basketball players rather than just the teams? This is where prop bets come into play. Oddsmakers will also set over/under totals for player stats in a given contest. A prop bet enables bettors to wager on whether or not a player will score a certain amount of points, get a certain amount of rebounds, etc. "
+ uid=7_838 StaticText "Need help evaluating prop bet wagers? The "
+ uid=7_839 link "Action Labs" url="https://labs.actionnetwork.com/"
+ uid=7_840 StaticText "Action Labs"
+ uid=7_841 StaticText " Prop Tool will help you organize, sort, and grade hundreds of prop bet odds throughout the College Basketball season. Be sure to check in daily to keep up with the college basketball action throughout the season."
+ uid=7_842 heading "College Basketball Daily Fantasy" level="3"
+ uid=7_843 StaticText "Sports betting is not currently legal in every state. Daily Fantasy Sports (DFS) have filled the void in those places where it is not legal as a sports betting alternative. "
+ uid=7_844 StaticText "The traditional fantasy options require users to draft a team and set a lineup each day to compete against other players. This model of daily fantasy was made popular by the likes of "
+ uid=7_845 link "DraftKings" url="https://www.actionnetwork.com/online-sports-betting/reviews/draftkings"
+ uid=7_846 StaticText "DraftKings"
+ uid=7_847 StaticText " and "
+ uid=7_848 link "FanDuel" url="https://www.actionnetwork.com/online-sports-betting/reviews/fanduel"
+ uid=7_849 StaticText "FanDuel"
+ uid=7_850 StaticText ". Newer DFS games like "
+ uid=7_851 link "PrizePicks" url="https://www.actionnetwork.com/online-sports-betting/reviews/prizepicks"
+ uid=7_852 StaticText "PrizePicks"
+ uid=7_853 StaticText " are gaining traction amongst fans, though, where bettors place wagers on player props rather than having to compete against other users."
+ uid=7_854 StaticText "Get a competitive edge using Action’s "
+ uid=7_855 link "FantasyLabs" url="https://www.fantasylabs.com/"
+ uid=7_856 StaticText "FantasyLabs"
+ uid=7_857 StaticText " to breakdown player stats and matchups to set a winning lineup each week."
+ uid=7_858 heading "College Basketball Betting Help" level="2"
+ uid=7_859 StaticText "After reading this College Basketball betting guide, you should be able to read and understand an odds table. Being able to read an odds table, however, does not necessarily guarantee betting success. "
+ uid=7_860 StaticText "Here are a few tools from The Action Network that can help improve your betting win rate in college basketball."
+ uid=7_861 heading "College Basketball Public Betting Percentages" level="3"
+ uid=7_862 StaticText "The Action Network collects a vast amount of betting data on college basketball games, including the number of bets and how much money is placed on them. We condense this information into an easy-to-read interactive chart that helps you to assess what bets are most popular amongst the public and where the majority of the money is being wagered. Whether you choose to go along with how the majority has placed its bets, or you choose to break from the pack, is up to you."
+ uid=7_863 image "College Basketball Public Betting Percentages Example"
+ uid=7_864 StaticText " "
+ uid=7_865 StaticText "The Action Network "
+ uid=7_866 link "NFL Public Betting" url="https://www.actionnetwork.com/nfl/public-betting"
+ uid=7_867 StaticText "NFL Public Betting"
+ uid=7_868 StaticText " page enables you to filter between bet types as well so you can see the public sentiment across moneyline, spread, and totals."
+ uid=7_869 heading "PRO Projections" level="3"
+ uid=7_870 StaticText "Our in-house College Basketball experts evaluate daily matchups based on a multitude of factors including recent team performance, player value, injuries, and more. Condensing all these factors, The Action Network grades each matchup and provides an edge percentage to let you know which bets are most likely to succeed. Subscribe to"
+ uid=7_871 link " Action PRO" url="https://www.actionnetwork.com/upgrade"
+ uid=7_872 StaticText " Action PRO"
+ uid=7_873 StaticText " to get unlimited access to our College Basketball projections."
+ uid=7_874 image "College Basketball PRO Projections Example"
+ uid=7_875 heading "PRO Report" level="3"
+ uid=7_876 StaticText "If you want to make betting decisions for yourself, but don’t have the time to collect all the data, check out our "
+ uid=7_877 link "College Basketball PRO Report" url="https://www.actionnetwork.com/ncaab/sharp-report"
+ uid=7_878 StaticText "College Basketball PRO Report"
+ uid=7_879 StaticText ". This analysis highlights five key bettings signals: big money, sharp action, expert projections, expert picks, and historical betting systems."
+ uid=7_880 image "College Basketball PRO Report Example"
+ uid=7_881 heading "College Basketball Expert Picks" level="3"
+ uid=7_882 StaticText "If you’re in a time crunch or just want to leave it to the experts, our College Basketball Picks page is the place to go. Here, you can see how experts are picking a particular contest across various bet types and odds. Check this page regularly during the season to see how you stack up against the experts."
+ uid=7_883 image "College Basketball Expert Picks Example"
+ uid=7_884 heading "College Basketball Odds FAQ" level="2"
+ uid=7_885 StaticText "Frequently Asked Questions"
+ uid=7_886 StaticText "How do I read CBB point spreads?"
+ uid=7_887 image "Right Arrow"
+ uid=7_888 StaticText "What is an over/under or total in college basketball?"
+ uid=7_889 image "Right Arrow"
+ uid=7_890 StaticText "How do Moneylines work for basketball?"
+ uid=7_891 image "Right Arrow"
+ uid=7_892 StaticText "Where Can I Bet on College Basketball?"
+ uid=7_893 image "Right Arrow"
+ uid=7_894 StaticText "Recent Stories"
+ uid=7_895 link "See All" url="https://www.actionnetwork.com/ncaab/archive/1"
+ uid=7_896 StaticText "See All"
+ uid=7_897 link "News Image" url="https://www.actionnetwork.com/ncaab/northwestern-wildcats-vs-iowa-hawkeyes-predictions-picks-odds-college-basketball-sunday-february-8"
+ uid=7_898 image "Northwestern vs Iowa: One Team Looking Toward Next Season article feature image" url="https://images.actionnetwork.com/1200x675/blog/2026/02/BSSSS.webp"
+ uid=7_899 link "Northwestern vs Iowa: One Team Looking Toward Next Season" url="https://www.actionnetwork.com/ncaab/northwestern-wildcats-vs-iowa-hawkeyes-predictions-picks-odds-college-basketball-sunday-february-8"
+ uid=7_900 StaticText "Northwestern vs Iowa: One Team Looking Toward Next Season"
+ uid=7_901 link "Ky McKeon" url="https://www.actionnetwork.com/article/author/ky-mckeon-2"
+ uid=7_902 StaticText "Ky McKeon"
+ uid=7_903 StaticText "•"
+ uid=7_904 StaticText "31 mins ago"
+ uid=7_905 link "News Image" url="https://www.actionnetwork.com/ncaab/ucf-knights-vs-cincinnati-bearcats-predictions-picks-odds-college-basketball-sunday-february-8"
+ uid=7_906 image "UCF vs Cincinnati: This Squad Bound to Bounce Back article feature image" url="https://images.actionnetwork.com/1200x675/blog/2026/02/ucf2826.webp"
+ uid=7_907 link "UCF vs Cincinnati: This Squad Bound to Bounce Back" url="https://www.actionnetwork.com/ncaab/ucf-knights-vs-cincinnati-bearcats-predictions-picks-odds-college-basketball-sunday-february-8"
+ uid=7_908 StaticText "UCF vs Cincinnati: This Squad Bound to Bounce Back"
+ uid=7_909 link "Doug Ziefel" url="https://www.actionnetwork.com/article/author/doug-ziefel"
+ uid=7_910 StaticText "Doug Ziefel"
+ uid=7_911 StaticText "•"
+ uid=7_912 StaticText "46 mins ago"
+ uid=7_913 link "News Image" url="https://www.actionnetwork.com/ncaab/texas-tech-red-raiders-vs-west-virginia-mountaineers-predictions-picks-odds-college-basketball-sunday-february-8"
+ uid=7_914 image "Texas Tech vs West Virginia: Bounce-Back Spot article feature image" url="https://images.actionnetwork.com/1200x675/blog/2026/02/toppinjt.webp"
+ uid=7_915 link "Texas Tech vs West Virginia: Bounce-Back Spot" url="https://www.actionnetwork.com/ncaab/texas-tech-red-raiders-vs-west-virginia-mountaineers-predictions-picks-odds-college-basketball-sunday-february-8"
+ uid=7_916 StaticText "Texas Tech vs West Virginia: Bounce-Back Spot"
+ uid=7_917 link "Jordan Mann" url="https://www.actionnetwork.com/article/author/jmann"
+ uid=7_918 StaticText "Jordan Mann"
+ uid=7_919 StaticText "•"
+ uid=7_920 StaticText "1 hour ago"
+ uid=7_921 link "News Image" url="https://www.actionnetwork.com/ncaab/michigan-wolverines-vs-ohio-state-buckeyes-predictions-picks-odds-college-basketball-sunday-february-8"
+ uid=7_922 image "Michigan vs Ohio State: Chance for Upset? article feature image" url="https://images.actionnetwork.com/1200x675/blog/2026/02/OhioState-Michigan-NCAAB-2_8.webp"
+ uid=7_923 link "Michigan vs Ohio State: Chance for Upset?" url="https://www.actionnetwork.com/ncaab/michigan-wolverines-vs-ohio-state-buckeyes-predictions-picks-odds-college-basketball-sunday-february-8"
+ uid=7_924 StaticText "Michigan vs Ohio State: Chance for Upset?"
+ uid=7_925 link "Sean Paul" url="https://www.actionnetwork.com/article/author/sean-paul-smock-ii"
+ uid=7_926 StaticText "Sean Paul"
+ uid=7_927 StaticText "•"
+ uid=7_928 StaticText "1 hour ago"
+ uid=7_929 contentinfo
+ uid=7_930 link "Sports Betting Odds" url="https://www.actionnetwork.com/odds"
+ uid=7_931 heading "Sports Betting Odds" level="3"
+ uid=7_932 link "NFL Odds & Betting Lines" url="https://www.actionnetwork.com/nfl/odds"
+ uid=7_933 StaticText "NFL Odds & Betting Lines"
+ uid=7_934 link "NBA Odds & Betting Lines" url="https://www.actionnetwork.com/nba/odds"
+ uid=7_935 StaticText "NBA Odds & Betting Lines"
+ uid=7_936 link "College Football Odds & Betting Lines" url="https://www.actionnetwork.com/ncaaf/odds"
+ uid=7_937 StaticText "College Football Odds & Betting Lines"
+ uid=7_938 link "Men's College Basketball Odds & Betting Lines" url="https://www.actionnetwork.com/ncaab/odds"
+ uid=7_939 StaticText "Men's College Basketball Odds & Betting Lines"
+ uid=7_940 link "Women's College Basketball Odds & Betting Lines" url="https://www.actionnetwork.com/ncaaw/odds"
+ uid=7_941 StaticText "Women's College Basketball Odds & Betting Lines"
+ uid=7_942 link "MLB Odds & Betting Lines" url="https://www.actionnetwork.com/mlb/odds"
+ uid=7_943 StaticText "MLB Odds & Betting Lines"
+ uid=7_944 link "NHL Odds & Betting Lines" url="https://www.actionnetwork.com/nhl/odds"
+ uid=7_945 StaticText "NHL Odds & Betting Lines"
+ uid=7_946 heading "Expert Picks" level="3"
+ uid=7_947 link "NFL Picks & Analysis" url="https://www.actionnetwork.com/nfl/picks"
+ uid=7_948 StaticText "NFL Picks & Analysis"
+ uid=7_949 link "NBA Picks & Analysis" url="https://www.actionnetwork.com/nba/picks"
+ uid=7_950 StaticText "NBA Picks & Analysis"
+ uid=7_951 link "College Football Picks & Analysis" url="https://www.actionnetwork.com/ncaaf/picks"
+ uid=7_952 StaticText "College Football Picks & Analysis"
+ uid=7_953 link "College Basketball Picks & Analysis" url="https://www.actionnetwork.com/ncaab/picks"
+ uid=7_954 StaticText "College Basketball Picks & Analysis"
+ uid=7_955 link "MLB Picks & Analysis" url="https://www.actionnetwork.com/mlb/picks"
+ uid=7_956 StaticText "MLB Picks & Analysis"
+ uid=7_957 link "NHL Picks & Analysis" url="https://www.actionnetwork.com/nhl/picks"
+ uid=7_958 StaticText "NHL Picks & Analysis"
+ uid=7_959 link "Online Sports Betting News & Offers" url="https://www.actionnetwork.com/online-sports-betting"
+ uid=7_960 heading "Online Sports Betting News & Offers" level="3"
+ uid=7_961 link "Legalization Tracker" url="https://www.actionnetwork.com/news/legal-sports-betting-united-states-projections"
+ uid=7_962 StaticText "Legalization Tracker"
+ uid=7_963 link "New York" url="https://www.actionnetwork.com/online-sports-betting/new-york"
+ uid=7_964 StaticText "New York"
+ uid=7_965 link "Louisiana" url="https://www.actionnetwork.com/online-sports-betting/louisiana"
+ uid=7_966 StaticText "Louisiana"
+ uid=7_967 link "Maryland" url="https://www.actionnetwork.com/online-sports-betting/maryland"
+ uid=7_968 StaticText "Maryland"
+ uid=7_969 link "Arizona" url="https://www.actionnetwork.com/online-sports-betting/arizona"
+ uid=7_970 StaticText "Arizona"
+ uid=7_971 link "Colorado" url="https://www.actionnetwork.com/online-sports-betting/colorado"
+ uid=7_972 StaticText "Colorado"
+ uid=7_973 link "New Jersey" url="https://www.actionnetwork.com/online-sports-betting/new-jersey"
+ uid=7_974 StaticText "New Jersey"
+ uid=7_975 link "Pennsylvania" url="https://www.actionnetwork.com/online-sports-betting/pennsylvania"
+ uid=7_976 StaticText "Pennsylvania"
+ uid=7_977 link "Illinois" url="https://www.actionnetwork.com/online-sports-betting/illinois"
+ uid=7_978 StaticText "Illinois"
+ uid=7_979 link "Michigan" url="https://www.actionnetwork.com/online-sports-betting/michigan"
+ uid=7_980 StaticText "Michigan"
+ uid=7_981 link "Ohio" url="https://www.actionnetwork.com/online-sports-betting/ohio"
+ uid=7_982 StaticText "Ohio"
+ uid=7_983 link "Massachusetts" url="https://www.actionnetwork.com/online-sports-betting/massachusetts"
+ uid=7_984 StaticText "Massachusetts"
+ uid=7_985 link "Kansas" url="https://www.actionnetwork.com/online-sports-betting/kansas"
+ uid=7_986 StaticText "Kansas"
+ uid=7_987 link "North Carolina" url="https://www.actionnetwork.com/online-sports-betting/north-carolina"
+ uid=7_988 StaticText "North Carolina"
+ uid=7_989 link "Missouri" url="https://www.actionnetwork.com/online-sports-betting/missouri"
+ uid=7_990 StaticText "Missouri"
+ uid=7_991 link "Canada" url="https://www.actionnetwork.com/online-sports-betting/canada"
+ uid=7_992 StaticText "Canada"
+ uid=7_993 link "Best U.S. Sportsbook Bonuses & Reviews" url="https://www.actionnetwork.com/online-sports-betting/reviews"
+ uid=7_994 heading "Best U.S. Sportsbook Bonuses & Reviews" level="3"
+ uid=7_995 link "bet365 Bonus Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/bet365"
+ uid=7_996 StaticText "bet365 Bonus Code"
+ uid=7_997 link "Fanatics Sportsbook Promo" url="https://www.actionnetwork.com/online-sports-betting/reviews/fanatics-sportsbook"
+ uid=7_998 StaticText "Fanatics Sportsbook Promo"
+ uid=7_999 link "BetMGM Bonus Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/betmgm"
+ uid=7_1000 StaticText "BetMGM Bonus Code"
+ uid=7_1001 link "DraftKings Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/draftkings"
+ uid=7_1002 StaticText "DraftKings Promo Code"
+ uid=7_1003 link "FanDuel Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/fanduel"
+ uid=7_1004 StaticText "FanDuel Promo Code"
+ uid=7_1005 link "Caesars Sportsbook Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/caesars-sportsbook"
+ uid=7_1006 StaticText "Caesars Sportsbook Promo Code"
+ uid=7_1007 link "theScore Bet Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/thescore-bet"
+ uid=7_1008 StaticText "theScore Bet Promo Code"
+ uid=7_1009 link "Underdog Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/underdog-fantasy"
+ uid=7_1010 StaticText "Underdog Promo Code"
+ uid=7_1011 link "BetRivers Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/betrivers"
+ uid=7_1012 StaticText "BetRivers Promo Code"
+ uid=7_1013 link "Sleeper Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/sleeper"
+ uid=7_1014 StaticText "Sleeper Promo Code"
+ uid=7_1015 link "Thrillzz Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/thrillzz-social-sportsbook"
+ uid=7_1016 StaticText "Thrillzz Promo Code"
+ uid=7_1017 link "DK Pick6 Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/draftkings-pick6"
+ uid=7_1018 StaticText "DK Pick6 Promo Code"
+ uid=7_1019 link "Fliff Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/fliff-social-sportsbook"
+ uid=7_1020 StaticText "Fliff Promo Code"
+ uid=7_1021 link "Dabble Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/dabble"
+ uid=7_1022 StaticText "Dabble Promo Code"
+ uid=7_1023 link "Betr Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/betr-sportsbook"
+ uid=7_1024 StaticText "Betr Promo Code"
+ uid=7_1025 link "Boom Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/boom-fantasy"
+ uid=7_1026 StaticText "Boom Promo Code"
+ uid=7_1027 link "Rebet Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/rebet"
+ uid=7_1028 StaticText "Rebet Promo Code"
+ uid=7_1029 link "Chalkboard Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/chalkboard"
+ uid=7_1030 StaticText "Chalkboard Promo Code"
+ uid=7_1031 link "Kalshi Referral Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/kalshi"
+ uid=7_1032 StaticText "Kalshi Referral Code"
+ uid=7_1033 link "Bleacher Nation Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/bleacher-nation-fantasy"
+ uid=7_1034 StaticText "Bleacher Nation Promo Code"
+ uid=7_1035 link "OwnersBox Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/ownersbox"
+ uid=7_1036 StaticText "OwnersBox Promo Code"
+ uid=7_1037 link "PrizePicks Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/prizepicks"
+ uid=7_1038 StaticText "PrizePicks Promo Code"
+ uid=7_1039 link "Hard Rock Bet Promo Code" url="https://www.actionnetwork.com/online-sports-betting/reviews/hard-rock-sportsbook"
+ uid=7_1040 StaticText "Hard Rock Bet Promo Code"
+ uid=7_1041 heading "The Action Network" level="3"
+ uid=7_1042 link "About" url="https://www.actionnetwork.com/about"
+ uid=7_1043 StaticText "About"
+ uid=7_1044 link "Our Authors" url="https://www.actionnetwork.com/article/author"
+ uid=7_1045 StaticText "Our Authors"
+ uid=7_1046 link "Editorial Policy" url="https://www.actionnetwork.com/editorial-policy"
+ uid=7_1047 StaticText "Editorial Policy"
+ uid=7_1048 link "Careers" url="https://bettercollective.recruitee.com/northamerica"
+ uid=7_1049 StaticText "Careers"
+ uid=7_1050 link "Action Store" url="https://store.actionnetwork.com/"
+ uid=7_1051 StaticText "Action Store"
+ uid=7_1052 link "Press" url="https://www.actionnetwork.com/press"
+ uid=7_1053 StaticText "Press"
+ uid=7_1054 link "Support" url="https://actionnetworkhq.zendesk.com/hc/en-us"
+ uid=7_1055 StaticText "Support"
+ uid=7_1056 link "Podcasts" url="https://linktr.ee/actionpodcasts"
+ uid=7_1057 StaticText "Podcasts"
+ uid=7_1058 link "Newsletter" url="https://www.actionnetwork.com/sub"
+ uid=7_1059 StaticText "Newsletter"
+ uid=7_1060 link "Contact Us" url="https://actionnetworkhq.zendesk.com/hc/en-us/requests/new"
+ uid=7_1061 StaticText "Contact Us"
+ uid=7_1062 button "Your Privacy Choices"
+ uid=7_1063 heading "Social" level="3"
+ uid=7_1064 link "Follow on Twitter" url="https://twitter.com/ActionNetworkHQ"
+ uid=7_1065 StaticText "Follow on Twitter"
+ uid=7_1066 link "Like on Facebook" url="https://facebook.com/ActionNetworkHQ"
+ uid=7_1067 StaticText "Like on Facebook"
+ uid=7_1068 link "Follow on Instagram" url="https://instagram.com/ActionNetworkHQ"
+ uid=7_1069 StaticText "Follow on Instagram"
+ uid=7_1070 link "Subscribe on YouTube" url="https://www.youtube.com/TheActionNetwork"
+ uid=7_1071 StaticText "Subscribe on YouTube"
+ uid=7_1072 link "Follow on Twitch" url="https://www.twitch.tv/actionnetworkhq"
+ uid=7_1073 StaticText "Follow on Twitch"
+ uid=7_1074 link "Follow on Discord" url="https://discord.com/invite/actionnetwork"
+ uid=7_1075 StaticText "Follow on Discord"
+ uid=7_1076 heading "Mobile Apps" level="3"
+ uid=7_1077 link "Apple Store Link" url="https://itunes.apple.com/app/apple-store/id1083677479?pt=118035255&ct=slashapp&mt=8"
+ uid=7_1078 link "Google Play Link" url="https://play.google.com/store/apps/details?id=com.sportsaction.action&utm_source=AN.com&utm_campaign=slashapp"
+ uid=7_1079 StaticText "Action Network is a part of the digital sports media group Better Collective"
+ uid=7_1080 LineBreak "
+"
+ uid=7_1081 StaticText "Better Collective owns and operates a group of leading digital sports media brands across the world"
+ uid=7_1082 LineBreak "
+"
+ uid=7_1083 StaticText "Better Collective is dual listed on Nasdaq Stockholm and Nasdaq Copenhagen"
+ uid=7_1084 StaticText "Copyright "
+ uid=7_1085 StaticText "2026"
+ uid=7_1086 StaticText " © Action Network Inc, All Rights Reserved."
+ uid=7_1087 link "Privacy Policy" url="https://www.actionnetwork.com/privacy"
+ uid=7_1088 StaticText "Privacy Policy"
+ uid=7_1089 StaticText " | "
+ uid=7_1090 link "Terms of Service" url="https://www.actionnetwork.com/terms"
+ uid=7_1091 StaticText "Terms of Service"
+ uid=7_1092 StaticText " | "
+ uid=7_1093 link "AdChoices" url="https://www.actionnetwork.com/privacy#adchoices"
+ uid=7_1094 StaticText "AdChoices"
+ uid=7_1095 StaticText " | "
+ uid=7_1096 link "Responsible Gambling" url="https://www.actionnetwork.com/general/responsible-gambling"
+ uid=7_1097 StaticText "Responsible Gambling"
+ uid=7_1098 StaticText "DISCLAIMER: This site is 100% for entertainment purposes only and does not involve real money betting."
+ uid=7_1099 StaticText "This site contains commercial content."
+ uid=7_1100 link "Disclosure" url="https://www.actionnetwork.com/legal-online-sports-betting/affiliate-marketing-disclosure"
+ uid=7_1101 StaticText "Disclosure"
+ uid=7_1102 StaticText "."
+ uid=7_1103 LineBreak "
+"
+ uid=7_1104 LineBreak "
+"
+ uid=7_1105 StaticText "Gambling Problem? Call "
+ uid=7_1106 link "1-800-GAMBLER" url="https://www.ncpgambling.org/"
+ uid=7_1107 StaticText "1-800-GAMBLER"
+ uid=7_1108 StaticText "."
+ uid=7_1109 LineBreak "
+"
+ uid=7_1110 LineBreak "
+"
+ uid=7_1111 LineBreak "
+"
+ uid=7_1112 link url="https://www.ncpgambling.org/"
+ uid=7_1113 LineBreak "
+"
+ uid=7_1114 LineBreak "
+"
+ uid=7_1115 LineBreak "
+"
+ uid=7_1116 image "Better Collective Logo"