Skip to content

Agentify the status page insights#12

Merged
mbrailtown merged 19 commits into
mainfrom
matthew/csharp-example
Jun 19, 2026
Merged

Agentify the status page insights#12
mbrailtown merged 19 commits into
mainfrom
matthew/csharp-example

Conversation

@mbrailtown

Copy link
Copy Markdown
Contributor

Updated C# app so it can use a remote agent instead of the inline API calls.

Added a Python example agent that uses railtracks and can work as the remote agent for the C# app.

mbrailtown and others added 17 commits June 12, 2026 11:04
Ports the CSharp DailyInsightService functionality to a standalone Python
FastAPI service. Drives an Anthropic Claude agent through Railtracks with a
single `@rt.function_node` tool that calls the Railengine Python SDK directly
(replaces the C# MCP attachment). POST /insight returns the same one-line-per-
metric plain-text summary so the existing /api/insight card can render it
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t:AgentUrl

When DailyInsight:AgentUrl is set, the 24h loop POSTs to {AgentUrl}/insight
instead of calling Anthropic + MCP inline; Anthropic:ApiKey becomes optional
since the agent owns the LLM call. Leave AgentUrl empty (or unset) to keep the
existing inline path unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ken is set

Lets the C# app talk to a daily-insight agent endpoint sitting behind a
bearer-auth reverse proxy. Token is read from configuration and attached as
Authorization: Bearer <token> on every /insight POST when non-empty; left blank
the request goes unauthenticated, matching the existing local-dev behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… env vars

Dockerfile installs the daily_insight package via pyproject.toml and runs
uvicorn on :8000 — standard container shape for any deploy target.

configure_runtime_env bridges LLM_API_KEY <-> ANTHROPIC_API_KEY at startup so
the same setting works under either name; the Anthropic SDK still reads
ANTHROPIC_API_KEY internally. INSIGHT_MODEL renamed to the conventional
LLM_MODEL. README leads with the provider-neutral names and documents the
mapping in one row instead of two.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that together turn opaque 500s into actionable diagnostics:

1. railtownai.init() in the FastAPI lifespan when RAILTOWN_API_KEY is set.
   The SDK attaches a RailtownHandler to Python's root logger, so any
   logger.exception/error call downstream ships to Railtown automatically.
   No-op when the key is unset — agent runs normally.

2. @app.exception_handler(Exception) that logs the traceback and returns a
   JSON 500 body with the exception type and message instead of FastAPI's
   default plain-text "Internal Server Error". Replaces the per-endpoint
   try/except in /insight so unhandled errors from any route get the same
   treatment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches InsightService from rt.Flow(...).ainvoke() to the rt.Session +
rt.call() pattern so we can grab session.payload() and hand it to
railtownai.upload_agent_run(). The payload contains the nodes / edges /
steps that drive the Railtracks viz UI in Conductr — useful for inspecting
which tools the agent called and what prompts it used.

Skipped silently when railtownai.init() hasn't run (no RAILTOWN_API_KEY) so
local invocations without observability keep working unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ailtracks evaluators

Self-contained smoke test for the deployed agent. Body:
  { "sample_size": 1, "agent_run_id": "<uuid-or-null>" }

sample_size (default 1, capped at 5) drives how many fresh /insight runs the
endpoint generates before scoring. agent_run_id is accepted for forward
compatibility — currently logged and ignored, will identify a historical
session to evaluate in a follow-up.

Three evaluators per call:
- ToolUseEvaluator (free) — checks the "at most 1 get_recent_metrics call" contract
- LLMInferenceEvaluator (free) — checks LLM call latency/tokens/errors
- JudgeEvaluator with two custom Categorical metrics:
    FormatCompliance  (Compliant / MinorDeviation / MajorDeviation)
    FactualGrounding  (FullyGrounded / PartiallyGrounded / Hallucinated)

The judge uses AnthropicLLM with the agent's same key (LLM_API_KEY bridge);
override the judge model via EVAL_JUDGE_MODEL. When RAILTOWN_API_KEY is set,
each EvaluationResult uploads to Conductr via railtownai.upload_agent_evaluation
through evals.evaluate's payload_callback hook.

Sessions for extract_agent_data_points are staged in a request-scoped
tempfile.TemporaryDirectory so concurrent /evaluate calls don't see each
other's session payloads. agent_selection=False + agents=[...] keeps
evals.evaluate headless (it would otherwise hang on rich.prompt.Prompt.ask).

Bumps railtracks[visual] to >=1.4.0 for the tool-eval-requires-multiple-
sessions bugfix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 1.4.0 release of railtracks raises ValueError in ToolUseEvaluator when
fewer than 2 aggregate nodes exist per tool, which kills /evaluate calls
with sample_size=1 (our default). The fix landed on main 2026-06-11 but
hasn't shipped to PyPI yet, so we pin to the specific commit via PEP 508's
git URL syntax. Swap back to a version range once 1.4.1+ releases.

Dockerfile gains a minimal `apt-get install git` step because python:3.10-slim
lacks git and pip needs it to clone the pinned commit at build time.

Fix being pinned:
RailtownAI/railtracks@4e4ed57

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pinned commit installation failed during the ACR build:

    error: Multiple top-level packages discovered in a flat-layout:
    ['pdoc', 'packages'].

The railtracks repo is a monorepo (packages/, pdoc/, docs/, etc. at the
root), so setuptools' flat-layout auto-discovery refuses to guess. The
actual package lives at packages/railtracks/ with its own pyproject.toml
+ src/ layout. PEP 508's #subdirectory= URL fragment tells pip to enter
that subdirectory before running the build.

The PyPI wheel sidesteps this entirely (it's pre-built), so this hint
will go away when we revert to a >=1.4.1 version range.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
POST /evaluate gains a second mode. When the body sets agent_run_id, the
service fetches that single session from Conductr's platform API via
railtownai.get_agent_runs([str(agent_run_id)]) (new in railtownai 2.0.14),
stages the returned payload in the same request-scoped tempdir the fresh
mode uses, and runs evaluators against just that session. sample_size is
ignored in this mode and insights=[] in the response since no fresh
generation happens.

Fresh-mode behaviour is unchanged. AgentRunsNotInitializedError /
AgentRunFetchError bubble up to the global exception handler and surface
in the 500 JSON body — clean signal for the operator when CONDUCTR_PROJECT_PAT
or CONDUCTR_PROJECT_ID is missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /evals/ namespace leaves room for future eval-related endpoints (list,
retrieve a past evaluation by id, etc.) without crowding the top-level
route table. /evals/run is the verb-y entry point that triggers an
evaluation; siblings under /evals/ would be CRUD-y reads against persisted
results.

Wire-level breaking change for the endpoint URL only — request and response
shapes are unchanged. The renamed handler is now run_evaluation (the
previous `evaluate` name shadowed the imported evals function in some
contexts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
railtownai.upload_agent_evaluation catches all non-config exceptions and
returns False without raising or logging — HTTP-layer failures (token
rejected by Conductr, ingestion endpoint unreachable, rail-engine-ingest
errors) look indistinguishable from success at the call site. The SDK
also explicitly suppresses rail-engine-ingest INFO logs (so the underlying
HTTP response never reaches our root logger).

Three changes:

1. Gate on EVALUATIONS_API_TOKEN (the actual prerequisite) instead of
   railtownai.get_railtown_handler() (which only signals RAILTOWN_API_KEY
   init — a different feature).

2. Check the return value of upload_agent_evaluation. The previous code
   logged "uploaded to Conductr" on every call regardless of outcome.

3. Log loudly when the SDK returns False so silent failures become visible
   in container logs, with diagnostic guidance pointing at the suppressed
   rail-engine-ingest logger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ck hook

railtownai.upload_agent_evaluation is a sync function that drives its async
implementation via asyncio.run(). When invoked from inside FastAPI's running
event loop (which is where evals.evaluate's payload_callback fires from),
asyncio.run() raises RuntimeError — the SDK catches that as a generic
Exception and returns False without surfacing the cause. The unscheduled
coroutine leaks a "coroutine was never awaited" warning to stderr.

Confirmed by container logs after the previous diagnostic-logging commit:

    RuntimeWarning: coroutine '_upload_agent_evaluation_async' was never awaited
    return False

Fix: drop the payload_callback hook, iterate evaluation_results after
evals.evaluate() returns, and run each upload via asyncio.to_thread so the
SDK gets the clean thread-local state (no running loop) it expects. Batches
the whole list into a single SDK call for efficiency — one ingest session
instead of one per result.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Historical mode now produces names like:
  daily-insight-<agent-run-uuid>-20260612T233526Z

Fresh mode keeps the original timestamp-only shape since there's no single
run id to attach (each of the N generated sessions has its own).

Makes it trivial to grep Conductr's evaluation list for "did I evaluate
this specific session" — previously the only way to correlate was via
timestamp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uccesses

- hadolint DL3008 on Dockerfile: add ignore directive for the apt-version
  pin rule. Pinning git to a specific apt version forces a Dockerfile edit
  on every base-image refresh, which is brittle for a transient dependency
  that only exists until we move back to a PyPI railtracks release.
- black on src/config/env.py: wrap the over-88-char REQUIRED_ENV_VARS list
  comprehension across multiple lines as black prefers.
- black on src/services/evaluation_service.py: the inverse — collapse the
  asyncio.to_thread call and the logger.info call back to single lines now
  that they fit under 88 chars.
- flake8 E501 on src/agents/insight_agent.py: split the 163-char prompt
  line at sentence-ish boundaries. Newlines inside a paragraph are
  semantically equivalent to spaces for the LLM.
- flake8 E501 on src/controllers/api.py: wrap the long pydantic Field
  description= strings using Python implicit string concatenation inside
  the description=(...) parens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous attempt placed the directive earlier in the comment block,
with two more explanatory comment lines between it and the RUN
instruction. Hadolint binds the ignore comment to the *next instruction*
— intervening comments break that binding, so the warning kept firing
in CI.

Reorder so the ignore comment is the final comment before RUN; the
explanation stays above it in the same block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
railtracks 1.4.1 (PyPI, 2026-06-15) includes the ToolUseEvaluator
single-session fix from commit 4e4ed57 (2026-06-11) that we'd been
pinning. Reverting to a version range:

- pyproject.toml: drop the git+url with #subdirectory hint, restore
  the simple `railtracks[visual]>=1.4.1` line.
- Dockerfile: drop the apt-get install git layer and its hadolint
  DL3008 ignore directive — git was only there to clone railtracks
  from source during the pin, no longer needed for a PyPI wheel.

Smaller image, faster builds, simpler dep graph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mbrailtown

mbrailtown commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Python/daily-insight/src/controllers/api.py could provide the basis for a kickstart example so would appreciate feedback on that in particular @Amir-R25 .

The Conductr Hosted endpoint now POSTs { agent_run_ids: [...] } so the
agent has to accept a list. Historical mode now fetches all named runs
in one get_agent_runs call and scores them under a single
evaluation_name; the batch name is daily-insight-batch<N>-<ts> when
there's more than one id (single-id keeps the existing id-in-name form).
Adds ConfigDict(extra=forbid) so misspelled keys (e.g. the old singular
agent_run_id) return 422 instead of silently dropping into Fresh mode.
Caps agent_run_ids at 1..10 ids and adds a model_validator that rejects
bodies setting both sample_size and agent_run_ids — uses
model_fields_set so the default sample_size=1 doesn't trip the XOR when
only agent_run_ids is provided.

Plus a black reformat in evaluation_service.py: one stray blank line
removed from the import block, one f-string assignment unwrapped from
parens.
@mbrailtown mbrailtown merged commit cc8608f into main Jun 19, 2026
13 checks passed
@mbrailtown mbrailtown deleted the matthew/csharp-example branch June 19, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants