implement agent reliability features from ADK#2001
Open
peterj wants to merge 2 commits into
Open
Conversation
Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
|
Warning Testing pausedMonthly snapshot limit reached. Update your plan for additional snapshots and to resume testing. |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end “reliability” and “retry” configuration across CRDs, translators, runtimes (Python + Go), and UI so operators can control tool-call self-healing, model-call caps, debug logging, and provider SDK HTTP retry behavior.
Changes:
- Adds
spec.declarative.reliabilityto Agent/SandboxAgent (tool retries, max model calls, debug logging) and wires it through translators into runtime config. - Adds
spec.retry.attemptsto ModelConfig and maps it tomax_retriesin generated runtime config, including provider-specific wiring and warnings for unsupported providers. - Updates UI forms to expose these settings under an Advanced section; adds unit/golden test coverage for the new plumbing.
Reviewed changes
Copilot reviewed 39 out of 40 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| ui/src/types/index.ts | Adds UI type definitions for retry and agent reliability. |
| ui/src/components/AgentsProvider.tsx | Extends agent form data shape with reliability fields. |
| ui/src/app/models/new/page.tsx | Adds Advanced UI for model retry attempts and payload wiring. |
| ui/src/app/agents/new/page.tsx | Adds Advanced UI for agent reliability settings and payload wiring. |
| ui/src/app/actions/agents.ts | Maps agent form reliability fields into Agent/SandboxAgent specs. |
| python/packages/kagent-adk/tests/unittests/test_reliability_config.py | Adds unit tests for reliability parsing and reflect-and-retry behavior. |
| python/packages/kagent-adk/tests/unittests/test_model_retry.py | Adds unit tests for model retry/max_retries wiring into provider SDK clients. |
| python/packages/kagent-adk/tests/unittests/models/test_sap_ai_core.py | Removes a stray blank line in test file. |
| python/packages/kagent-adk/src/kagent/adk/types.py | Adds ReliabilityConfig, max_retries, and wires retry into provider LLM constructors. |
| python/packages/kagent-adk/src/kagent/adk/models/_openai.py | Plumbs max_retries into OpenAI/Azure OpenAI SDK client construction. |
| python/packages/kagent-adk/src/kagent/adk/models/_anthropic.py | Plumbs max_retries into Anthropic SDK client construction. |
| python/packages/kagent-adk/src/kagent/adk/converters/request_converter.py | Adds max_llm_calls to ADK RunConfig construction. |
| python/packages/kagent-adk/src/kagent/adk/cli.py | Enables debug logging + reflect-and-retry plugins based on reliability config. |
| python/packages/kagent-adk/src/kagent/adk/_reflect_retry_plugin.py | Adds MCP isError: true handling to reflect-and-retry tool plugin. |
| python/packages/kagent-adk/src/kagent/adk/_agent_executor.py | Surfaces clearer user error when max LLM calls limit is exceeded. |
| python/packages/kagent-adk/src/kagent/adk/_a2a.py | Passes reliability max LLM call cap into executor config. |
| helm/kagent-crds/templates/kagent.dev_sandboxagents.yaml | Adds CRD schema for spec.declarative.reliability (SandboxAgent). |
| helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml | Adds CRD schema for spec.retry.attempts (ModelConfig). |
| helm/kagent-crds/templates/kagent.dev_agents.yaml | Adds CRD schema for spec.declarative.reliability (Agent). |
| go/core/internal/controller/translator/agent/testdata/outputs/modelconfig_with_retry.json | Adds golden output validating ModelConfig retry translation. |
| go/core/internal/controller/translator/agent/testdata/outputs/agent_with_reliability.json | Adds golden output validating Agent reliability translation. |
| go/core/internal/controller/translator/agent/testdata/inputs/modelconfig_with_retry.yaml | Adds translator test input covering ModelConfig retry. |
| go/core/internal/controller/translator/agent/testdata/inputs/agent_with_reliability.yaml | Adds translator test input covering Agent reliability. |
| go/core/internal/controller/translator/agent/compiler.go | Translates agent reliability config into ADK config output. |
| go/core/internal/controller/translator/agent/adk_api_translator.go | Renames TLS helper to populate shared BaseModel fields and adds retry translation. |
| go/api/v1alpha2/zz_generated.deepcopy.go | Adds deepcopy support for new API fields. |
| go/api/v1alpha2/modelconfig_types.go | Adds Retry *ModelRetryConfig to ModelConfigSpec. |
| go/api/v1alpha2/agent_types.go | Adds Reliability *ReliabilityConfig to DeclarativeAgentSpec. |
| go/api/config/crd/bases/kagent.dev_sandboxagents.yaml | Adds generated base CRD schema for sandbox agent reliability. |
| go/api/config/crd/bases/kagent.dev_modelconfigs.yaml | Adds generated base CRD schema for model retry. |
| go/api/config/crd/bases/kagent.dev_agents.yaml | Adds generated base CRD schema for agent reliability. |
| go/api/adk/types.go | Adds max_retries on BaseModel and reliability in AgentConfig. |
| go/adk/pkg/runner/maxllmcalls.go | Implements Go-runtime max LLM calls limiter plugin. |
| go/adk/pkg/runner/adapter.go | Builds Go ADK plugins for reliability config (logging, retry-and-reflect, max calls). |
| go/adk/pkg/runner/adapter_test.go | Adds tests for reliability plugin building and max call limiter behavior. |
| go/adk/pkg/models/openai.go | Wires max retries into OpenAI/Azure OpenAI Go SDK options. |
| go/adk/pkg/models/base.go | Adds MaxRetries to transport config shape. |
| go/adk/pkg/models/anthropic.go | Wires max retries into Anthropic Go SDK options. |
| go/adk/pkg/agent/agent.go | Logs when retry config is ignored for unsupported providers; passes max retries in transport config. |
| docs/architecture/crds-and-types.md | Updates architecture doc to include new CRD/type layers and Go runtime parity note. |
Files not reviewed (1)
- go/api/v1alpha2/zz_generated.deepcopy.go: Generated file
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds reliability features toagents and ModelConfig.
Agent CRD has the
reliabilityfield with the following config:toolRetries(1–10): reflect-and-retry on failed tool calls — the runtime injects structured reflection guidance into the model context so the agent can self-correct instead of repeating the same failing callmaxLLMCalls(≥1): cost safety rail capping model calls per request; the run stops with a clear error instead of looping (runtime default 500)debugLogging: verbose logging of every LLM request/response and tool call to the agent pod logs (off by default)ModelConfig
retryfield:attempts(0–20): automatic retries of failed LLM HTTP requests (429/408/5xx) with exponential backoff via the provider SDK. Supported for OpenAI, Azure OpenAI, Anthropic, and Gemini; other providers log a warningUI is updated to show the settings in the advanced section: