Skip to content

Agent Skills Updates From Live Trials#1493

Merged
chadvoegele merged 38 commits into
mainfrom
cvoegele/agent_evals
May 21, 2026
Merged

Agent Skills Updates From Live Trials#1493
chadvoegele merged 38 commits into
mainfrom
cvoegele/agent_evals

Conversation

@chadvoegele
Copy link
Copy Markdown
Contributor

@chadvoegele chadvoegele commented May 14, 2026

What does this PR do?

Type of change: bug fix

Usage

Ask Claude Code:

Quantize `mistralai/Mistral-Medium-3.5-128B` to NVFP4 using the ModelOpt NVFP4 experts-only recipe.

Run on $cluster

Evaluate the resulting quantized checkpoint on:
- GPQA Diamond AA v3
- SciCode AA v2

Complete the quantization and evaluation workflow end to end. Prompt when you require user input, otherwise keep going.

Testing

I'm running the full loop with the above prompt, and iterating on skills to resolve undesired agent behavior.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: ✅
  • Did you get Claude approval on this PR?: TODO

Additional Information

See trials log for details.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added SLURM Quality of Service (QoS) configuration support for job submission
    • Introduced 8 new evaluation task recipes (AIME 2025, GPQA, IFBench, LiveCodeBench, SciCode, AA-LCR, HLE-AA, MMMU-Pro, tau2_bench)
    • Enhanced job monitoring with continuous polling-based tracking
  • Documentation

    • Restructured evaluation workflow with explicit dry-run, canary, and full-run validation stages
    • Expanded PTQ validation with mandatory pre-deployment verification gates
    • Updated remote cluster selection and quantization detection guidance
  • Tests

    • Updated evaluation test expectations to reflect refined workflow stages

Review Change Stack

@chadvoegele chadvoegele requested a review from a team as a code owner May 14, 2026 16:43
@chadvoegele chadvoegele requested a review from jingyu-ml May 14, 2026 16:43
@chadvoegele chadvoegele changed the title Agent Skills Updates From Live Trials WIP: Agent Skills Updates From Live Trials May 14, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR enhances evaluation workflows with improved quantization detection and baseline comparison requirements, standardizes task configurations into Markdown-based recipes with Python extraction helpers, introduces PTQ post-quantization validation gates, adds job monitoring via durable polling, and enables SLURM Quality of Service configuration support.

Changes

Evaluation Workflow Enhancement

Layer / File(s) Summary
Quantization Detection, Deployment Preference, and Baseline Comparison
.claude/skills/evaluation/SKILL.md
Rewrites ModelOpt quantization auto-detection to check config.json first, fall back to hf_quant_config.json, conditionally apply vLLM --quantization flags, and prefer vLLM for NEL self-deployment. Adds baseline-comparison preflight requiring baseline identification and matching infrastructure setup before accepting quantized score deltas.
Task Configuration Refinement and Container Registry Authentication
.claude/skills/evaluation/SKILL.md
Updates task confirmation to prefer recipes/tasks/ reference documents and re-display the updated task list before final confirmation. Expands SLURM registry-auth into a decision flow for public vs. access-restricted images with conditional credential verification and single-retry limits.
Execution Gating, Step Restructuring, and Result Verification
.claude/skills/evaluation/SKILL.md, .claude/skills/launching-evals/references/analyze-results.md
Restructures workflow to add Step 7.5 (registry auth), split Step 8 into 8.1/8.2/8.3 (dry-run, canary, full evaluation), and introduce Steps 9–10 for run verification and baseline-vs-quantized comparability. Strengthens validation with expanded NEL timeout/resume behavior guidance and broadened diagnostics grep patterns.
Evaluation Test Expectations Updates
.claude/skills/evaluation/tests/evals.json
Updates three evaluation scenarios to reflect new quantization detection logic, conditional vLLM --quantization flag behavior, and enhanced evaluation flow expectations including canary log validation and explicit baseline-vs-quantized comparability verification.

Task Recipe Standardization and Configuration Updates

Layer / File(s) Summary
Core Benchmark Task Recipes
.claude/skills/evaluation/recipes/tasks/aime2025.md, gpqa.md, mmlu_pro.md, scicode.md, scicode.yaml (removed)
Creates comprehensive Markdown task recipes with metadata, YAML fragments for evaluation.tasks, and Python-based score extraction logic. AIME 2025 defines 16-repeat symbolic reasoning; GPQA implements 16-sample pass@1[avg-of-N] with stderr; SciCode includes deployment constraints and dual-group score extraction. Removes corresponding YAML recipe files.
Additional Benchmark Task Recipes
.claude/skills/evaluation/recipes/tasks/ifbench.md, livecodebench.md, mmmu_pro.md, aa_lcr.md, hle_aa_v2.md, tau2_bench_telecom.md, and removed *.yaml files
Adds recipes for IFBench (8-repeat prompt_loose_accuracy), LiveCodeBench v6 (dataset split/retry config), MMMU-Pro (multimodal symbolic_correct), AA-LCR (judge-backed long-context), HLE-AA-v2 (judge evaluation), and Tau2-Bench Telecom (user-simulator endpoint). Removes corresponding YAML configurations.
Example Evaluation Configuration and Environment Updates
.claude/skills/evaluation/recipes/examples/example_eval.yaml, env.example
Updates example_eval.yaml to describe task references as providing benchmark requirements and YAML fragments for composition, changes "Smoke test" to "Canary", and corrects MMLU-Pro identifier to nemo_skills.ns_mmlu_pro. Revises env.example to generalize JUDGE_API_KEY for multiple judge-backed tasks and introduce USER_API_KEY for tau2_bench_telecom.

PTQ Validation Framework

Layer / File(s) Summary
PTQ Preflight Checks and Post-Quantization Validation Gate
.claude/skills/ptq/SKILL.md
Adds preflight step to inspect recipe coverage patterns before calibration and flag partial de-quantization risks. Transforms post-quantization validation into a mandatory pre-deployment gate directing to checkpoint-validation.md with required reporting of size ratios, precision counts, and metadata diffs.
Checkpoint Validation Framework and Scripts
.claude/skills/ptq/references/checkpoint-validation.md
Expands checkpoint validation into explicit pre-deployment gate with size/bits reduction verification (blocking compression recipes with ratio >= 1.0), linear-layer quantization coverage validation (detecting config mismatches via precision counts), and non-weight metadata consistency checking. Provides bash-invoked Python size-check snippet and comprehensive layer coverage script reading safetensors index and hf_quant_config.json.
PTQ Test Scenario with Validation Gating
.claude/skills/ptq/tests.json
Adds evaluation scenario for FP8→partial NVFP4 quantization with mandatory checkpoint-validation gating including size/ratio reporting, layer precision coverage checks, and conditional stopping before eval submission if validation fails.

Monitoring, Baseline Verification, and Infrastructure

Layer / File(s) Summary
Durable Job Monitor Loop with Expanded Terminal States
.claude/skills/monitor/SKILL.md
Replaces 15-minute recurring cron with durable monitor loop that continuously polls .claude/active_jobs.json, checks each registered job until terminal states, emits state-change events, removes terminal jobs, and exits when registry is empty. Expands terminal state set and updates raw SLURM termination detection from squeue to sacct-based check with filtering rules.
Baseline Comparison Verification and Multi-Cluster Selection
.claude/skills/launching-evals/references/analyze-results.md, .claude/skills/common/environment-setup.md
Adds checklist item requiring baseline comparison for quantized runs against matching baseline using consistent benchmark/task/infrastructure. Updates multi-cluster selection to require explicit user prompt instead of silently defaulting to default_cluster.
SLURM Quality of Service Configuration Support
tools/launcher/slurm_config.py, tools/launcher/core.py
Adds qos field to SlurmConfig dataclass, extends slurm_factory to accept qos parameter from SLURM_QOS environment variable, and passes qos into run.SlurmExecutor construction, enabling Quality of Service configuration for submitted jobs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • ChenhanYu
  • kaix-nv
  • meenchen
  • mxinO
  • shengliangxu
  • realAsma

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error PR adds tools/launcher/core.py with # nosec comments (B404, B603, B607) without justification or codeowner approval as required by SECURITY.md. Replace # nosec with inline comments explaining subprocess safety (hardcoded git commands, no shell=True). Request @NVIDIA/modelopt-setup-codeowners review with justification in PR description.
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly identifies the pull request as an update to agent skills based on live trial iterations, directly matching the primary objective of incorporating learned improvements from testing.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cvoegele/agent_evals

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/monitor/SKILL.md:
- Around line 54-59: The documentation/logic currently enforces "report only
state changes" universally; update it so that user-initiated checks (e.g., when
the user explicitly asks "check status") return the full current status for each
job rather than only deltas—leave monitor-driven checks to still compare against
`last_status` in `.claude/active_jobs.json` and report only changes. Adjust the
wording and any associated pseudocode/implementation notes to branch on the
trigger type ("monitor output" vs "user-initiated") and on user-initiated flows
ensure you read the registry, check each job, return current state for each job,
and then update `last_status` accordingly. Ensure references to `last_status`
and `.claude/active_jobs.json` remain consistent so maintainers can find and
implement the conditional behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 27a4dd3f-7246-45c7-8699-ec80a318c50a

📥 Commits

Reviewing files that changed from the base of the PR and between 229ba61 and ddf58b5.

📒 Files selected for processing (11)
  • .claude/skills/common/environment-setup.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/tests/evals.json
  • .claude/skills/launching-evals/references/analyze-results.md
  • .claude/skills/monitor/SKILL.md
  • .claude/skills/ptq/SKILL.md
  • .claude/skills/ptq/references/checkpoint-validation.md
  • .gitignore
  • modelopt/torch/quantization/model_quant.py
  • tools/launcher/core.py
  • tools/launcher/slurm_config.py

Comment thread .claude/skills/monitor/SKILL.md Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-21 22:17 UTC

@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.20%. Comparing base (c9098b6) to head (b06caec).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1493      +/-   ##
==========================================
+ Coverage   76.75%   77.20%   +0.45%     
==========================================
  Files         476      476              
  Lines       51811    51811              
==========================================
+ Hits        39767    40001     +234     
+ Misses      12044    11810     -234     
Flag Coverage Δ
examples 40.09% <ø> (-0.64%) ⬇️
gpu 60.10% <ø> (+<0.01%) ⬆️
regression 15.21% <ø> (+0.07%) ⬆️
unit 52.63% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.claude/skills/evaluation/SKILL.md (1)

203-212: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix task snippet schema mismatch (tasks vs evaluation.tasks).

The Step 5 example contradicts earlier instructions to edit evaluation.tasks. Keeping tasks: here can make generated configs invalid or ignored.

Suggested fix
-  tasks:
-    - name: <task>
-      nemo_evaluator_config:
-        config:
-          params:
-            temperature: <value>
-            max_new_tokens: <value>
-            ...
+  evaluation:
+    tasks:
+      - name: <task>
+        nemo_evaluator_config:
+          config:
+            params:
+              temperature: <value>
+              max_new_tokens: <value>
+              ...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/SKILL.md around lines 203 - 212, The YAML example
uses a top-level "tasks:" key which conflicts with the expected
"evaluation.tasks" namespace; update the snippet so the tasks list is nested
under "evaluation.tasks" (e.g., replace "tasks:" with "evaluation.tasks:" and
keep the existing task entries like "name" and "nemo_evaluator_config" intact),
and verify any references to "tasks" in the surrounding text or examples are
corrected to "evaluation.tasks" to keep schema consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/recipes/tasks/gpqa.md:
- Around line 73-75: The extractor extract_gpqa_score currently can raise
IndexError when called without args and uses raw open(...) with yaml.safe_load;
add basic argument validation (ensure path is provided and repeats, if given, is
an int) and use a safe file context (with open(path, "r") as f) and
yaml.safe_load(f) while catching FileNotFoundError and yaml.YAMLError and
re-raising a clear ValueError; also validate that the expected keys exist in the
loaded dict (results -> groups -> gpqa -> metrics) and raise ValueError if
missing. Apply the same validation and safe-loading pattern to the similar
extractor function around lines 94-97 to ensure consistent error handling.

In @.claude/skills/evaluation/recipes/tasks/scicode.md:
- Around line 105-108: The extract_score function currently assumes a valid path
and opens the YAML without a context manager; fix it by validating the path
argument (raise ValueError or return a clear error if path is falsy), check the
file exists (catch FileNotFoundError), and read the YAML using a context manager
(with open(path) as f: data = yaml.safe_load(f)); then safely access
TASKS[group] and data["results"]["groups"][group"]["metrics"] (use .get or catch
KeyError to provide a clearer error). Apply the same changes to the other
identical snippet that reads the YAML and accesses metrics.

---

Outside diff comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 203-212: The YAML example uses a top-level "tasks:" key which
conflicts with the expected "evaluation.tasks" namespace; update the snippet so
the tasks list is nested under "evaluation.tasks" (e.g., replace "tasks:" with
"evaluation.tasks:" and keep the existing task entries like "name" and
"nemo_evaluator_config" intact), and verify any references to "tasks" in the
surrounding text or examples are corrected to "evaluation.tasks" to keep schema
consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1c2365f0-8719-4c16-9dd8-7640d57849f7

📥 Commits

Reviewing files that changed from the base of the PR and between ddf58b5 and 0633922.

📒 Files selected for processing (18)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
  • .claude/skills/evaluation/recipes/tasks/aime2025.md
  • .claude/skills/evaluation/recipes/tasks/aime2025.yaml
  • .claude/skills/evaluation/recipes/tasks/gpqa.md
  • .claude/skills/evaluation/recipes/tasks/gpqa.yaml
  • .claude/skills/evaluation/recipes/tasks/ifbench.md
  • .claude/skills/evaluation/recipes/tasks/ifbench.yaml
  • .claude/skills/evaluation/recipes/tasks/livecodebench.md
  • .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.md
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
  • .claude/skills/evaluation/recipes/tasks/scicode.md
  • .claude/skills/evaluation/recipes/tasks/scicode.yaml
  • .claude/skills/evaluation/tests/evals.json
  • .claude/skills/ptq/SKILL.md
  • .claude/skills/ptq/references/checkpoint-validation.md
  • .claude/skills/ptq/tests.json
💤 Files with no reviewable changes (6)
  • .claude/skills/evaluation/recipes/tasks/ifbench.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
  • .claude/skills/evaluation/recipes/tasks/aime2025.yaml
  • .claude/skills/evaluation/recipes/tasks/scicode.yaml
  • .claude/skills/evaluation/recipes/tasks/gpqa.yaml
  • .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
✅ Files skipped from review due to trivial changes (3)
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
  • .claude/skills/evaluation/recipes/tasks/ifbench.md
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • .claude/skills/evaluation/tests/evals.json

Comment thread .claude/skills/evaluation/recipes/tasks/gpqa.md Outdated
Comment on lines +105 to +108
def extract_score(path, group="scicode"):
spec = TASKS[group]
data = yaml.safe_load(open(path))
metrics = data["results"]["groups"][group]["metrics"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden CLI/file handling in the score extractor snippet.

The snippet crashes with IndexError when no path is passed, and it opens the YAML file without a context manager.

Suggested fix
 def extract_score(path, group="scicode"):
     spec = TASKS[group]
-    data = yaml.safe_load(open(path))
+    with open(path, "r", encoding="utf-8") as f:
+        data = yaml.safe_load(f)
@@
 if __name__ == "__main__":
-    path = sys.argv[1]
+    if len(sys.argv) < 2:
+        raise SystemExit("Usage: python extract_score.py <results.yaml> [scicode|gpqa]")
+    path = sys.argv[1]
     group = sys.argv[2] if len(sys.argv) > 2 else "scicode"
     print(extract_score(path, group))

Also applies to: 135-138

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/recipes/tasks/scicode.md around lines 105 - 108,
The extract_score function currently assumes a valid path and opens the YAML
without a context manager; fix it by validating the path argument (raise
ValueError or return a clear error if path is falsy), check the file exists
(catch FileNotFoundError), and read the YAML using a context manager (with
open(path) as f: data = yaml.safe_load(f)); then safely access TASKS[group] and
data["results"]["groups"][group"]["metrics"] (use .get or catch KeyError to
provide a clearer error). Apply the same changes to the other identical snippet
that reads the YAML and accesses metrics.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 260-267: The preflight check currently tests for any credential
entry using the generic grep command; change it to verify credentials per
registry host used by the selected images by searching for the specific registry
hostnames (not just any "machine" entry) in ~/.config/enroot/.credentials.
Update the documented check (the grep invocation shown) to demonstrate matching
the actual registry host(s) (e.g., loop or run grep for each selected image's
registry host) so the preflight returns true only when credentials exist for
those specific hosts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 466a2d74-5bf4-40f5-a94e-78d17d191a59

📥 Commits

Reviewing files that changed from the base of the PR and between 0633922 and 29bbb20.

📒 Files selected for processing (7)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/env.example
  • .claude/skills/evaluation/recipes/tasks/aa_lcr.md
  • .claude/skills/evaluation/recipes/tasks/aime2025.md
  • .claude/skills/evaluation/recipes/tasks/hle_aa.md
  • .claude/skills/evaluation/recipes/tasks/ifbench.md
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro_aa_v3.md
✅ Files skipped from review due to trivial changes (6)
  • .claude/skills/evaluation/recipes/env.example
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro_aa_v3.md
  • .claude/skills/evaluation/recipes/tasks/aa_lcr.md
  • .claude/skills/evaluation/recipes/tasks/hle_aa.md
  • .claude/skills/evaluation/recipes/tasks/ifbench.md
  • .claude/skills/evaluation/recipes/tasks/aime2025.md

Comment thread .claude/skills/evaluation/SKILL.md
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.claude/skills/evaluation/recipes/tasks/tau2_bench_telecom.md (1)

38-39: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Complete the Score Extraction section.

The Score Extraction section contains only a header with no content. Users need guidance on how to extract and interpret the pass_1 metric for this task. Please add content similar to other task recipes (e.g., AIME 2025, GPQA) that explains which metric to use and how to extract it from the evaluation results.

Do you want me to help draft the score extraction guidance based on the primary metric pass_1 and the tau2_bench harness documentation?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/recipes/tasks/tau2_bench_telecom.md around lines
38 - 39, The Score Extraction section is empty—add a short paragraph stating
that the primary metric is pass_1 and instruct users to extract the pass_1 value
from the tau2_bench evaluation results (e.g., from the metrics or results JSON
under the "pass_1" key), report it as a percentage (multiply by 100 if the
harness returns a fraction), and include any aggregation used (mean across seeds
or runs). Reference the task name tau2_bench_telecom and the harness docs for
exact JSON field names and show that the reported score should be the aggregated
pass_1 value used for comparisons.
♻️ Duplicate comments (1)
.claude/skills/evaluation/SKILL.md (1)

260-262: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make credential verification registry-specific.

The grep command checks for any credential entry, not credentials for the specific registry host(s) used by the selected images. This can pass the preflight but still fail image pulls if credentials for the required registries are missing.

📝 Suggested fix
-ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
+ssh <host> "awk '/^\s*machine\s+/ {print $2}' ~/.config/enroot/.credentials 2>/dev/null"
+# Verify the required registry host(s) from selected images are present (e.g., docker.io, nvcr.io, registry.internal).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/SKILL.md around lines 260 - 262, The current
preflight uses the command string ssh <host> "grep -E '^\s*machine\s+'
~/.config/enroot/.credentials 2>/dev/null" which only checks for any credential
entry; change it to verify registry-specific credentials by extracting registry
hostnames from the selected images and running grep for each host (e.g., grep -E
"^\s*machine\s+<registryHost>\b" ~/.config/enroot/.credentials) or equivalent
per-host checks over SSH; update the code that emits the ssh grep command in
SKILL.md to iterate the image registry list and fail the preflight if any
registryHost lookup returns no match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 59-68: Update the ambiguous phrase "Do Step 3, Step 4, then Step
7.5/8" in the SKILL.md shortcut section to explicitly state which Step 8
substeps apply; replace it with something like "Do Step 3, Step 4, Step 7.5,
then Step 8 (complete the applicable substeps 8.1/8.2/8.3)" or "Do Step 3, Step
4, Step 7.5, then Step 8.1–8.3 as applicable" so readers aren’t confused by the
restructured Step 8; locate the exact string "Do Step 3, Step 4, then Step
7.5/8" and update it accordingly.

---

Outside diff comments:
In @.claude/skills/evaluation/recipes/tasks/tau2_bench_telecom.md:
- Around line 38-39: The Score Extraction section is empty—add a short paragraph
stating that the primary metric is pass_1 and instruct users to extract the
pass_1 value from the tau2_bench evaluation results (e.g., from the metrics or
results JSON under the "pass_1" key), report it as a percentage (multiply by 100
if the harness returns a fraction), and include any aggregation used (mean
across seeds or runs). Reference the task name tau2_bench_telecom and the
harness docs for exact JSON field names and show that the reported score should
be the aggregated pass_1 value used for comparisons.

---

Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 260-262: The current preflight uses the command string ssh <host>
"grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null" which only
checks for any credential entry; change it to verify registry-specific
credentials by extracting registry hostnames from the selected images and
running grep for each host (e.g., grep -E "^\s*machine\s+<registryHost>\b"
~/.config/enroot/.credentials) or equivalent per-host checks over SSH; update
the code that emits the ssh grep command in SKILL.md to iterate the image
registry list and fail the preflight if any registryHost lookup returns no
match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 38a710dc-b0aa-4860-99a8-27d5da2f52cc

📥 Commits

Reviewing files that changed from the base of the PR and between 29bbb20 and 2580a58.

📒 Files selected for processing (13)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/env.example
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
  • .claude/skills/evaluation/recipes/tasks/aa_lcr.md
  • .claude/skills/evaluation/recipes/tasks/aime2025.md
  • .claude/skills/evaluation/recipes/tasks/gpqa.md
  • .claude/skills/evaluation/recipes/tasks/hle_aa_v2.md
  • .claude/skills/evaluation/recipes/tasks/ifbench.md
  • .claude/skills/evaluation/recipes/tasks/livecodebench.md
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.md
  • .claude/skills/evaluation/recipes/tasks/mmmu_pro.md
  • .claude/skills/evaluation/recipes/tasks/scicode.md
  • .claude/skills/evaluation/recipes/tasks/tau2_bench_telecom.md
✅ Files skipped from review due to trivial changes (6)
  • .claude/skills/evaluation/recipes/tasks/mmmu_pro.md
  • .claude/skills/evaluation/recipes/tasks/hle_aa_v2.md
  • .claude/skills/evaluation/recipes/tasks/ifbench.md
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.md
  • .claude/skills/evaluation/recipes/tasks/livecodebench.md
  • .claude/skills/evaluation/recipes/tasks/aa_lcr.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • .claude/skills/evaluation/recipes/env.example
  • .claude/skills/evaluation/recipes/tasks/gpqa.md
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
  • .claude/skills/evaluation/recipes/tasks/scicode.md

Comment thread .claude/skills/evaluation/SKILL.md
@chadvoegele chadvoegele force-pushed the cvoegele/agent_evals branch 3 times, most recently from f2b92ba to 32c2072 Compare May 20, 2026 14:45
Copy link
Copy Markdown
Collaborator

@shengliangxu shengliangxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Prefer the `pass@1[avg-of-N]` metric matching the configured repeat count.
If the repeat count is unknown, use the highest available `avg-of-N`.

```python
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq why we still need these python code here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to help guide the agent to pull the right score from the results.yml. There is a range of possible instructions:

  1. "extract the score"
  2. "extract the pass@1[avg-of-N]
  3. "run yq '.results.groups.aime25.metrics."pass@1[avg-of-16]".scores.symbolic_correct.value' results.yml"
  4. "run the python snippet below..."

The first instruction is far too vague. The agent will write something but probably not what you want.

This is from a recent run. Even with the Python snippet in the evaluation skill recipe, it ignored it and made it's own parser:

python3 -c "
import yaml
r = yaml.safe_load(open(\"$BASE/results.yml\"))
m = r.get(\"results\",{}).get(\"groups\",{}).get(\"mmlu-pro\",{}).get(\"metrics\",{}).get(\"pass@1\",{}).get(\"scores\",{})
print(\"symbolic_correct:\", m.get(\"symbolic_correct\",{}).get(\"value\"))
print(\"n:\", m.get(\"symbolic_correct\",{}).get(\"stats\",{}).get(\"count\"))
"'

Then I instructed the agent with Please reference the score extraction snippet in skills/evaluation/recipes/tasks/ when pullings scores from the results.yml for each dataset.

and it did

def extract_aime2025_score(path, repeats=None):
    data = yaml.safe_load(open(path))
    metrics = data["results"]["groups"]["aime25"]["metrics"]
    name = select_metric(metrics, repeats)
    scores = metrics[name]["scores"]
    accuracy = scores["symbolic_correct"]["value"]
    se = scores.get("symbolic_correct_statistics_std_err_across_runs", {}).get("value")
    return {"group": "aime25", "metric": name, "score_key": "symbolic_correct",
            "accuracy": accuracy,
            "stderr": se * 100 if se is not None else None}

I'm err-ing on the side of over-specifying because I want the agent to pull the exact score I want. If I can get it reliably read the skill that's there, I believe it will.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this make adding new evals harder because then we need to add a code section.

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Operator asked to prioritize landing this. Production code changes are small and look fine (qos is an optional, backward-compatible field on SlurmConfig plumbed through build_slurm_executor; default None so existing callers are unaffected). The bulk of the PR is iteration on existing .claude/skills/... agent docs, which doesn't introduce a new subsystem (design review protocol doesn't really fire — these skills already exist), and a human reviewer (shengliangxu) has already LGTM'd.

A few things still warrant a human eye before merge rather than an auto-approve:

  • PR is still marked WIP. Title is "WIP: Agent Skills Updates From Live Trials" and the PR body has unchecked TODOs ("Did you update Changelog: TODO", "Did you get Claude approval on this PR: TODO"). The author's own description says they're still iterating ("I'm running the full loop with the above prompt, and iterating on skills to resolve undesired agent behavior"). Worth confirming with @chadvoegele that this is actually ready to land vs. still being iterated on.
  • 💬 Unanswered maintainer question from @cjluo-nv on .claude/skills/evaluation/recipes/tasks/aime2025.md:51: "qq why we still need these python code here?" — this is about the score-extraction Python snippets that now appear inline in every task .md (gpqa, aime2025, scicode, mmlu_pro, mmmu_pro, ns_hle_aa, aa_lcr). Same question implicitly applies to all of them. No reply yet; should be resolved before merge.
  • tools/launcher/slurm_config.py style drift. The file now uses Optional[str] + from typing import Optional + a # ruff: noqa: UP045 suppression, which is inconsistent with the rest of the codebase using PEP-604 str | None. If there's a concrete reason (e.g. nemo_run's @run.cli.factory decorator can't introspect X | None on this Python version), a one-line comment explaining it would help future readers; otherwise the cleaner fix is to use str | None and drop both the import and the noqa.
  • Minor, unaddressed CodeRabbit nits in the skill snippets (low impact since these are agent reference docs, not executed library code, but easy fixes): extract_gpqa_score / extract_score in recipes/tasks/gpqa.md and recipes/tasks/scicode.md use bare open(path) instead of a with context manager and crash with IndexError if invoked without args.

No blocking correctness issues in the production code itself.


"""Slurm configuration and factory for the ModelOpt Launcher."""

# ruff: noqa: UP045
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Style: the rest of the codebase uses PEP-604 str | None rather than Optional[str]. Adding from typing import Optional plus a # ruff: noqa: UP045 suppression here is inconsistent. If nemo_run's @run.cli.factory / @run.autoconvert decorators can't introspect X | None annotations and that's why you reverted, please leave a one-line comment to that effect — otherwise switching to str | None and dropping both the import and the noqa would match the rest of the project.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a incompatibility between ruff and NeMo run launcher. Codex picked this as a happy middle ground. I can add a comment to reflect that.

Prefer the `pass@1[avg-of-N]` metric matching the configured repeat count.
If the repeat count is unknown, use the highest available `avg-of-N`.

```python
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Following up on @cjluo-nv's question ("qq why we still need these python code here?"): the same inlined extract_*_score Python snippets now appear in gpqa.md, scicode.md, mmlu_pro.md, mmmu_pro.md, ns_hle_aa.md, and aa_lcr.md as well. Could you reply on the PR with the rationale (e.g. agents copy-paste these into a session vs. an importable helper module under recipes/)? Worth resolving before merge so it's not re-asked on every future task recipe added.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered above.

@chadvoegele chadvoegele changed the title WIP: Agent Skills Updates From Live Trials Agent Skills Updates From Live Trials May 21, 2026
"score_key": "subtask_accuracy",
"stderr_scale": "subtasks",
},
"gpqa": {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why gpqa is here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's gone now.

@Edwardf0t1
Copy link
Copy Markdown
Contributor

Did a design-doc alignment pass on this PR (vs the ModelOpt Agent Skills Design). Strong overall alignment — the PTQ pre-flight sanity check, post-quant validation as a required gate, NEL timeout/resume framing, per-session monitor registry, and the recipe-yaml → reference-md restructure are all faithful applications of the doc's principles (lazy-loading via references, domain-skill verify-once gate, tool-affordance shape).

Two design-level questions that feel worth a decision rather than a stealth merge:

1. debugging-playbooks as a new top-level skill — intentional?

The doc lists shipped skills as ptq / deployment / evaluation / monitor / launching-evals / accessing-mlflow / debug + PLANNED compare-results. debugging-playbooks isn't there. The skill itself is well-designed (literal-symptom-keyed playbooks, explicit "don't reach for this on the first guess" negative trigger), and the cross-domain nature is real (vLLM AOT cache poisoning hits both deployment and evaluation runs). But the design doc explicitly flags "Reference file sprawl" as a Risk Callout, and adding a new top-level skill with one playbook today increases surface area against that exact risk.

Two reasonable resolutions:

  • (a) Keep it as a top-level skill, update the design doc to document it (why it's intent-shaped, not a reference under debug or per-domain).
  • (b) Fold it into debug/references/playbooks/ and let debug route to symptom-keyed playbooks when it's invoked.

Either works — want to align on which before this hardens.

2. Step 10 (Verify baseline-vs-quantized comparability) is doing compare-results' job

The design doc has compare-results as PLANNED — intent-shaped, cross-run. This PR puts comparability logic into evaluation/SKILL.md (which is also growing — +143/−48 — and starting to drift from the ~80–120 line lazy-loading target).

Fork in the road:

  • (a) Revise the design — evaluation owns single-run verification + comparability, compare-results gets narrower scope or gets dropped.
  • (b) Extract Step 10 into compare-results and ship it now (would also help shrink evaluation/SKILL.md).
  • (c) Keep Step 10 in evaluation, mark as stopgap, ship compare-results later.

My weak lean: (b) if you have appetite, (c) as a pragmatic interim. (a) ossifies evaluation into a "do everything" skill which the doc's hybrid-decomposition section explicitly argued against.


Mechanical items (not blockers, can be follow-ups):

  • Negative triggers — design doc has explicit suggested frontmatter text for ptq / monitor / debug / accessing-mlflow; this PR touches several of those skills' frontmatters but doesn't pick them up.
  • evaluation/SKILL.md lazy-loading drift — extract the NEL timeout/resume section + Step 9 / 10 detail into references; keep SKILL.md as a dispatch tree.
  • Test format — design recommends tests/evals.json directory layout uniformly; PR keeps the mix (ptq/tests.json flat + evaluation/tests/evals.json dir). Either migrate or punt explicitly with a tracking issue.

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
chadvoegele and others added 13 commits May 21, 2026 13:37
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
@chadvoegele chadvoegele force-pushed the cvoegele/agent_evals branch from 377e6c9 to 6d39913 Compare May 21, 2026 18:37
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review focused on the per-task .md files converted from YAML.

What's improved since last pass:

  • 💬 The full inline extract_*_score Python functions that prompted @cjluo-nv's "qq why we still need these python code here?" have been replaced with short dotted YAML score paths (e.g. results.groups.gpqa.metrics."pass@1[avg-of-N]".scores.symbolic_correct.value). Adding new evals no longer requires hand-writing a parser, which addresses the follow-up "will this make adding new evals harder" concern — good.
  • 💬 tools/launcher/slurm_config.py now has a top-of-file comment explaining the nemo_run / PEP-604 incompatibility, as @chadvoegele said he'd add. The Optional[...] + # ruff: noqa: UP045 is now self-documenting.
  • 💬 scicode "why is gpqa here?" — confirmed gone.

Still warrants a human eye before merge:

  • Likely bug in gpqa.md (n_samples: 32): every other NeMo-Skills ns_* task in this PR (aime2025, ifbench, livecodebench, mmlu_pro, scicode) and the in-repo examples/pruning/minitron/.../nemo_evaluator.yaml use num_repeats: for ns_gpqa. The deleted gpqa.yaml also used num_repeats: 5, and the companion recipes/examples/example_eval.yaml in this PR uses num_repeats: 32. Switching only the .md fragment to n_samples: 32 looks like a copy-paste from the AA-LCR fragment and will silently fall back to the harness default repeat count instead of running 32 repeats. Inline comment below.
  • tau2_bench_telecom.md Score Extraction is still empty — header line followed by EOF, no content. CodeRabbit flagged this last round and it's unresolved. Inline comment below.
  • SKILL.md Step 9 references a Python snippet that no longer exists. Line says "use the Score Extraction Python snippet from the matching task reference in recipes/tasks/<task>.md" — but the conversion just removed those Python snippets in favor of YAML paths. Worth updating the SKILL.md wording so the agent looks for the dotted path, not a Python function. Inline comment below.
  • 💬 PR is still WIP-titled with unchecked PR-body TODOs ("Did you update Changelog: TODO", "Did you get Claude approval on this PR: TODO"); @chadvoegele's own description still says "I'm running the full loop ... iterating on skills". Worth confirming readiness with the author before landing.

No issues with the production code (tools/launcher/{core,slurm_config}.py) — qos plumbing is small, optional, backward-compatible, and matches the existing SlurmConfig shape.

- name: ns_gpqa
nemo_evaluator_config:
config:
params:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Likely bug: this should be num_repeats: 32, not n_samples: 32. Every other NeMo-Skills ns_* fragment in this PR (aime2025/ifbench/livecodebench/mmlu_pro/scicode) uses num_repeats: under extra:, and the existing in-repo examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml configures ns_gpqa with num_repeats: 8. The companion recipes/examples/example_eval.yaml in this PR also uses num_repeats: 32 for ns_gpqa, and the deleted gpqa.yaml used num_repeats: 5. As written, the harness will not see a recognized 32-repeats knob here and will silently fall back to its default — and the score-extraction section just below assumes pass@1[avg-of-N] exists, which only shows up when repeats actually ran. (Note: AA-LCR uses n_samples: 3 legitimately because it's a different harness/container; that's not the precedent for ns_gpqa.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

api_key: USER_API_KEY
```

## Score Extraction
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

## Score Extraction is still an empty section (header followed by EOF). CodeRabbit flagged this last round. SKILL.md Step 9 instructs the agent to harvest scores using the matching task reference's Score Extraction guidance, so leaving this empty defeats the new gating step for tau2_bench_telecom runs. Either fill in the dotted path (e.g. results.groups.tau2_bench_telecom.metrics.pass_1.scores.<...> — please verify against an actual results.yml) or drop the empty header so it doesn't read as a TODO.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marked as TODO

Comment thread .claude/skills/evaluation/SKILL.md Outdated
4. Confirm the number of evaluated/scored samples matches for each task and split.
5. Confirm the same accuracy metric/score field is used for the baseline and quantized comparison.

Report the comparability summary alongside the score: prompt/template status, generation-setting status, sample-count status, reasoning-handling status, and the exact score field used. If any item differs, either rerun with matched settings or label the result as not an apples-to-apples quantization comparison.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

This sentence says "use the Score Extraction Python snippet from the matching task reference", but the converted recipes/tasks/*.md files no longer contain Python snippets — they document a dotted YAML path into results.yml. Recommend updating to something like "use the Score Extraction field path documented in the matching task reference in recipes/tasks/<task>.md" so the agent doesn't go hunting for a function that isn't there. (Same wording change probably worth doing wherever else the SKILL refers back to the Python snippet form.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
@@ -0,0 +1,22 @@
---
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a skill or just experience cache? If this is an experience cache, can we avoid promoting it to be a skill?

@@ -0,0 +1,22 @@
---
name: debugging-playbooks
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name might be fused with debugger skill exits before this PR

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review. Production code (tools/launcher/{core,slurm_config}.py) is small, optional, backward-compatible, and looks fine. Most prior bot/reviewer items are addressed:

  • 💬 gpqa.md n_samplesnum_repeats: 32 — fixed (commit ack'd by author).
  • 💬 scicode.md stray "gpqa" — fixed ("It's gone now").
  • 💬 SKILL.md Step 9 wording — updated to reference the Score Extraction section (dotted YAML path) instead of a Python snippet that no longer exists.
  • 💬 slurm_config.py Optional[str] + # ruff: noqa: UP045 — author added a top-of-file comment ("nemo_run's CLI parser cannot introspect PEP 604 optional annotations here").
  • 💬 cjluo-nv's "qq why we still need these python code here?" — resolved by replacing inline extract_*_score Python with short dotted YAML score paths in each recipes/tasks/*.md.

Still warrants a human eye before merge:

  • Unresolved design-review questions from @Edwardf0t1 (alignment vs. the ModelOpt Agent Skills Design doc). debugging-playbooks as a new top-level skill isn't in the shipped-skills list and the doc explicitly flags "reference file sprawl" as a risk; @cjluo-nv separately asked "Is this really a skill or just experience cache?" and noted the name may collide with the existing debug skill. Neither has been replied to. Picking (a) keep top-level + update design doc, vs. (b) fold into debug/references/playbooks/, is a design call worth a human sign-off.
  • Step 10 / compare-results scope vs. evaluation skill scopecompare-results/ is added in this PR but evaluation/SKILL.md also still grows comparability/Step-9 logic (+143/−48). @Edwardf0t1 flagged that evaluation is drifting past the lazy-loading target. Worth a human decision on the boundary.
  • recipes/tasks/tau2_bench_telecom.md Score Extraction is still a literal TODO. SKILL.md Step 9 explicitly routes the agent to the matching task reference for canonical score fields, so for tau2_bench_telecom that gating step currently has no content. Author marked it as TODO rather than fixing — acknowledged hole, but a human should decide whether to land the task fragment as TODO or hold it.
  • PR readiness is ambiguous. Title was previously "WIP:"-prefixed; PR body still has unchecked TODOs ("Did you update Changelog: TODO", "Did you get Claude approval on this PR: TODO"), and @chadvoegele's own description still says "I'm running the full loop... iterating on skills". Worth confirming with the author that this is the intended landing snapshot.

No blocking correctness issues in the production code itself; the open items are scope/design and one explicit TODO in a documentation fragment.


Additional comments (outside the PR diff):

  • .claude/skills/evaluation/SKILL.md:379 — > Bot comment.

@Edwardf0t1 flagged that evaluation/SKILL.md is drifting from the lazy-loading target (~80–120 lines), with NEL timeout/resume + Step 9 verification + comparability detail all inlined. Now that compare-results/ exists in this PR, consider extracting Step 9 / 10 detail into evaluation/references/run-verification.md and routing comparability fully to compare-results, so the dispatch tree in SKILL.md stays small. Not blocking, but worth a human pass before this hardens.


## Score Extraction

TODO
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Score Extraction is a literal TODO. SKILL.md Step 9 explicitly routes the agent to the matching task reference for canonical score harvesting, so for tau2_bench_telecom the gating step currently has nothing actionable to point at. Either fill in the dotted path against an actual results.yml (likely under results.groups.tau2_bench_telecom.metrics.pass_1.scores...) before merge, or drop the empty section so it doesn't look like a hole the agent will silently skip.

@@ -0,0 +1,22 @@
---
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Two reviewer threads about this skill are unanswered:

  • @cjluo-nv: "Is this really a skill or just experience cache? If this is an experience cache, can we avoid promoting it to be a skill?" and "this name might be fused with debugger skill exits before this PR".
  • @Edwardf0t1: this skill isn't in the ModelOpt Agent Skills Design doc's shipped-skills list (ptq / deployment / evaluation / monitor / launching-evals / accessing-mlflow / debug + PLANNED compare-results), and the doc explicitly flags "reference file sprawl" as a risk. Suggested either (a) keep top-level + update the design doc, or (b) fold into debug/references/playbooks/.

Please respond with the design rationale before landing — at minimum a reply explaining why this isn't a debug/references/ entry and how it differs from the existing debug skill.

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general!

One thing we can address in a following PR is about evaluation/SKILL.md lazy-loading drift — extract the NEL timeout/resume section + Step 9 / 10 detail into references; keep SKILL.md as a dispatch tree, which is a Claude skills design best practice.

@chadvoegele chadvoegele merged commit e2d4d73 into main May 21, 2026
83 of 84 checks passed
@chadvoegele chadvoegele deleted the cvoegele/agent_evals branch May 21, 2026 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants