Skip to content

implement agent reliability features from ADK#2001

Open
peterj wants to merge 2 commits into
mainfrom
peterj/addreliabilityfeatures
Open

implement agent reliability features from ADK#2001
peterj wants to merge 2 commits into
mainfrom
peterj/addreliabilityfeatures

Conversation

@peterj

@peterj peterj commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Adds reliability features toagents and ModelConfig.

Agent CRD has the reliability field with the following config:

  • toolRetries (1–10): reflect-and-retry on failed tool calls — the runtime injects structured reflection guidance into the model context so the agent can self-correct instead of repeating the same failing call
  • maxLLMCalls (≥1): cost safety rail capping model calls per request; the run stops with a clear error instead of looping (runtime default 500)
  • debugLogging: verbose logging of every LLM request/response and tool call to the agent pod logs (off by default)

ModelConfig retry field:

  • attempts (0–20): automatic retries of failed LLM HTTP requests (429/408/5xx) with exponential backoff via the provider SDK. Supported for OpenAI, Azure OpenAI, Anthropic, and Gemini; other providers log a warning

UI is updated to show the settings in the advanced section:

Screenshot 2026-06-11 at 4 11 52 PM

Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
Copilot AI review requested due to automatic review settings June 11, 2026 23:12
@chromatic-com

chromatic-com Bot commented Jun 11, 2026

Copy link
Copy Markdown

Warning

Testing paused

Monthly snapshot limit reached. Update your plan for additional snapshots and to resume testing.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end “reliability” and “retry” configuration across CRDs, translators, runtimes (Python + Go), and UI so operators can control tool-call self-healing, model-call caps, debug logging, and provider SDK HTTP retry behavior.

Changes:

  • Adds spec.declarative.reliability to Agent/SandboxAgent (tool retries, max model calls, debug logging) and wires it through translators into runtime config.
  • Adds spec.retry.attempts to ModelConfig and maps it to max_retries in generated runtime config, including provider-specific wiring and warnings for unsupported providers.
  • Updates UI forms to expose these settings under an Advanced section; adds unit/golden test coverage for the new plumbing.

Reviewed changes

Copilot reviewed 39 out of 40 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
ui/src/types/index.ts Adds UI type definitions for retry and agent reliability.
ui/src/components/AgentsProvider.tsx Extends agent form data shape with reliability fields.
ui/src/app/models/new/page.tsx Adds Advanced UI for model retry attempts and payload wiring.
ui/src/app/agents/new/page.tsx Adds Advanced UI for agent reliability settings and payload wiring.
ui/src/app/actions/agents.ts Maps agent form reliability fields into Agent/SandboxAgent specs.
python/packages/kagent-adk/tests/unittests/test_reliability_config.py Adds unit tests for reliability parsing and reflect-and-retry behavior.
python/packages/kagent-adk/tests/unittests/test_model_retry.py Adds unit tests for model retry/max_retries wiring into provider SDK clients.
python/packages/kagent-adk/tests/unittests/models/test_sap_ai_core.py Removes a stray blank line in test file.
python/packages/kagent-adk/src/kagent/adk/types.py Adds ReliabilityConfig, max_retries, and wires retry into provider LLM constructors.
python/packages/kagent-adk/src/kagent/adk/models/_openai.py Plumbs max_retries into OpenAI/Azure OpenAI SDK client construction.
python/packages/kagent-adk/src/kagent/adk/models/_anthropic.py Plumbs max_retries into Anthropic SDK client construction.
python/packages/kagent-adk/src/kagent/adk/converters/request_converter.py Adds max_llm_calls to ADK RunConfig construction.
python/packages/kagent-adk/src/kagent/adk/cli.py Enables debug logging + reflect-and-retry plugins based on reliability config.
python/packages/kagent-adk/src/kagent/adk/_reflect_retry_plugin.py Adds MCP isError: true handling to reflect-and-retry tool plugin.
python/packages/kagent-adk/src/kagent/adk/_agent_executor.py Surfaces clearer user error when max LLM calls limit is exceeded.
python/packages/kagent-adk/src/kagent/adk/_a2a.py Passes reliability max LLM call cap into executor config.
helm/kagent-crds/templates/kagent.dev_sandboxagents.yaml Adds CRD schema for spec.declarative.reliability (SandboxAgent).
helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml Adds CRD schema for spec.retry.attempts (ModelConfig).
helm/kagent-crds/templates/kagent.dev_agents.yaml Adds CRD schema for spec.declarative.reliability (Agent).
go/core/internal/controller/translator/agent/testdata/outputs/modelconfig_with_retry.json Adds golden output validating ModelConfig retry translation.
go/core/internal/controller/translator/agent/testdata/outputs/agent_with_reliability.json Adds golden output validating Agent reliability translation.
go/core/internal/controller/translator/agent/testdata/inputs/modelconfig_with_retry.yaml Adds translator test input covering ModelConfig retry.
go/core/internal/controller/translator/agent/testdata/inputs/agent_with_reliability.yaml Adds translator test input covering Agent reliability.
go/core/internal/controller/translator/agent/compiler.go Translates agent reliability config into ADK config output.
go/core/internal/controller/translator/agent/adk_api_translator.go Renames TLS helper to populate shared BaseModel fields and adds retry translation.
go/api/v1alpha2/zz_generated.deepcopy.go Adds deepcopy support for new API fields.
go/api/v1alpha2/modelconfig_types.go Adds Retry *ModelRetryConfig to ModelConfigSpec.
go/api/v1alpha2/agent_types.go Adds Reliability *ReliabilityConfig to DeclarativeAgentSpec.
go/api/config/crd/bases/kagent.dev_sandboxagents.yaml Adds generated base CRD schema for sandbox agent reliability.
go/api/config/crd/bases/kagent.dev_modelconfigs.yaml Adds generated base CRD schema for model retry.
go/api/config/crd/bases/kagent.dev_agents.yaml Adds generated base CRD schema for agent reliability.
go/api/adk/types.go Adds max_retries on BaseModel and reliability in AgentConfig.
go/adk/pkg/runner/maxllmcalls.go Implements Go-runtime max LLM calls limiter plugin.
go/adk/pkg/runner/adapter.go Builds Go ADK plugins for reliability config (logging, retry-and-reflect, max calls).
go/adk/pkg/runner/adapter_test.go Adds tests for reliability plugin building and max call limiter behavior.
go/adk/pkg/models/openai.go Wires max retries into OpenAI/Azure OpenAI Go SDK options.
go/adk/pkg/models/base.go Adds MaxRetries to transport config shape.
go/adk/pkg/models/anthropic.go Wires max retries into Anthropic Go SDK options.
go/adk/pkg/agent/agent.go Logs when retry config is ignored for unsupported providers; passes max retries in transport config.
docs/architecture/crds-and-types.md Updates architecture doc to include new CRD/type layers and Go runtime parity note.
Files not reviewed (1)
  • go/api/v1alpha2/zz_generated.deepcopy.go: Generated file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread go/adk/pkg/runner/adapter_test.go
Comment thread go/adk/pkg/runner/adapter_test.go
Comment thread ui/src/app/agents/new/page.tsx Outdated
Comment thread ui/src/app/actions/agents.ts Outdated
Comment thread ui/src/app/models/new/page.tsx
Comment thread python/packages/kagent-adk/src/kagent/adk/_agent_executor.py
Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants