Foundry private networking for Azure AI agents#8708
Draft
m5i-work wants to merge 15 commits into
Draft
Conversation
Add a declarative network: block to the Foundry service in azure.yaml and
teach the bicep-less synthesizer to provision a VNet-bound (network-secured)
account from it. Additive: an absent block yields today's public account.
- schema: network: surface (mode byo|managed, byo vnet/subnets tri-state,
managed isolationMode, dns create-or-reference) on microsoft.foundry.json
- synthesizer: decode network:, resolve ${VAR}, validate (mode coherence,
vnet ARM id, subnet tri-state/CIDR, DNS rg/sub), emit network params +
NetworkMode for telemetry
- templates: new modules/network.bicep, subnet.bicep, private-endpoint-dns.bicep;
resources.bicep/main.bicep guard the network path on enableNetworkIsolation
(publicNetworkAccess Disabled, networkAcls Deny, agent networkInjections,
account private endpoint + 3 AI DNS zones); main.arm.json regenerated
- provider: pass azd env for ${VAR} resolution, emit provision.network_mode,
warn that network: is ignored when endpoint: (brownfield) is set
- docs/tests: synthesizer network tests, eject module assertions, extension
README network section, telemetry-data.md provision.network_mode
The existing on-disk provision flow resolves ${VAR} in main.parameters.json
from the azd environment at provision time. Eject must therefore keep ${VAR}
references verbatim instead of resolving them eagerly from the process env and
freezing a literal into the ejected file.
- synthesis.Input gains PreserveVarRefs; when set, byo.vnet.id and
dns.subscription pass through verbatim and the format checks that cannot run
on an unexpanded placeholder are skipped (concrete-but-malformed still fails)
- eject (init --infra) sets PreserveVarRefs so the ejected main.parameters.json
stays environment-portable; the provision path still resolves and validates
- tests: synthesizer preserve-mode (pass-through + concrete validation) and an
eject e2e asserting ${AZURE_VNET_ID}/${AZURE_DNS_SUBSCRIPTION_ID} survive
Bash E2E harness validating host: microsoft.foundry private networking,
designed to minimize Azure resource-operation time:
- ONE real network account is provisioned (create+own matrix cell) with a BYO
--image agent, then deploy + invoke prove the agent works under the VNet.
- Scenario 1 (bicep-less) and the other 3 matrix cells (subnet create/reference
x DNS own/reference) are verified with 'azd provision --preview' (ARM what-if),
which creates nothing.
- Scenario 2 (eject) is verified against the live account: eject -> what-if
reports no changes (idempotent), proving the on-disk template + provision-time
${VAR} resolution reproduces the in-memory topology; a manual infra/ edit then
surfaces as the only delta. Guards the ${VAR}-preservation fix end-to-end.
- A shared BYO VNet (+ reference subnets / external DNS zones) is created once
and reused across cells.
Files: run-network-e2e.sh (phases 0-6 orchestrator), assert-resources.sh (live
az topology checks: publicNetworkAccess Disabled, account PE groupIds, 3 AI DNS
zones, agent-subnet delegation), lib.sh (logging/assert/azure.yaml mutation),
README.md (cost rationale, prerequisites, cleanup). Westus account region per
requirement; AcrPull granted to the project MI on the ABAC registry.
Decouple the private-networking E2E from the BYO-image init UX (PR 8689) so it
runs against the current branch today:
- Replace 'azd ai agent init --image' with a hand-authored azure.yaml fixture
(foundry service + network: block + agent image:), created via 'azd env new'.
image: yields includeAcr=false, matching BYO image, so no ACR at provision.
Verified the fixture synthesizes: mode=byo, enableNetworkIsolation=true,
includeAcr=false, ${VAR} resolved.
- Gate phase 5 (deploy + invoke) behind RUN_DEPLOY=true: the headless BYO-image
deploy needs the AZD_AGENT_SKIP_ACR short-circuit from PR 8689, otherwise
deploy defaults to build and fails. Phases 0-4 (local gates, shared VNet,
what-if matrix, one real provision, eject idempotency) validate all the
networking code without it.
- Fix the ABAC registry role: grant 'Container Registry Repository Reader'
(ABAC-aware) instead of AcrPull; move the grant into the gated deploy phase.
- Drop the --image preflight; README updated (scenario table, prerequisites,
RUN_DEPLOY usage, role).
…sing Two product bugs surfaced by live E2E provisioning (ARM what-if does not catch either; both require a real deployment): 1. networkInjections preflight failure. The account and the network module deploy in the same template, so subnetArmId: network!.outputs.agentSubnetId compiled to an unresolved reference() at the CognitiveServices RP preflight, which then failed to convert networkInjections to its typed contract (InvalidResourceProperties). Build the subnet ARM id as a deterministic string from the concrete vnetId param instead, and add an explicit dependsOn so the subnet still exists before injection. Recompiled main.arm.json. 2. AZURE_FOUNDRY_NETWORK_MODE missing from canonicalOutputNames. ARM mangles output-name casing (AZURE_..._MODE -> azurE_..._MODE); without the canonical remapping the env key was stored mis-cased and azd env get-value AZURE_FOUNDRY_NETWORK_MODE returned empty. Added it to the restore list and a regression case to TestArmOutputsToProto_RepairsMangledKeyCase. Validated end-to-end: real westus network-isolated Foundry account provisions green with all topology assertions passing (publicNetworkAccess Disabled, networkAcls Deny, private endpoint, agent-subnet delegation, 3 AI DNS zones, network mode byo), across the full subnet create/reference x DNS own/reference matrix, plus eject idempotency (what-if reports no changes).
Fixes found while running the harness against live Azure (phases 0-4): - Hand-authored project must include an agent.yaml (kind: hosted + image:) alongside azure.yaml; the foundry provider requires an agent definition file. - setup_project now sets AZURE_RESOURCE_GROUP (the subscription-scoped template creates the RG but the provider needs the name) and AZD_AGENT_SKIP_ACR=true (BYO-image deploy signal). - Phase 0 refreshes the dev extension from current source (build -> pack -> publish -> install) so the run tests local code, registering the provisioning-provider capability + microsoft.foundry provider. Gated by SKIP_EXT_REFRESH. - What-if matrix gates on a successful ARM what-if (exit 0) rather than grepping a summary-only preview; this still validates reference-mode subnet/zone existence and delegation against the real VNet. - Idempotent private-dns zone creation (reruns no longer fail on existing zones). - Add MAX_PHASE to stop early while iterating. - ACR grant uses the ABAC-aware Container Registry Repository Reader role. - Fix set -u unbound-variable crash in the phase-4 assert message. - .gitignore the transient per-run log directories. Phases 0-4 (local gates, shared VNet, what-if matrix, one real provision + topology assertions, eject idempotency) pass green. Phase 5 (deploy + invoke) stays gated behind RUN_DEPLOY=true and needs a reachable BYO agent image.
Update the Foundry private-network E2E harness so phase 5 can build the ~/agents/echo-dual image itself instead of requiring a prebuilt external image. - Add BUILD_IMAGE=true, ECHO_DUAL_DIR, ACR_NAME/ACR_RG, IMAGE_REPO/IMAGE_TAG. - Create the target ACR with --role-assignment-mode rbac-abac and reject reuse of non-ABAC registries. - Grant the caller Container Registry Repository Writer before the ACR Task push. Resolve the caller object id from the ARM token oid claim to avoid Microsoft Graph / Conditional Access failures. - Build with the required `az acr build --source-acr-auth-id [caller]` form. - Keep the project MI grant on the ABAC-aware Container Registry Repository Reader role for image pull. - Add TARGET_RG support so investigation runs can keep VNet, DNS, ACR, and the real Foundry env in a single RG. Live validation: the harness created an ABAC ACR, granted caller writer, built and pushed ~/agents/echo-dual with --source-acr-auth-id [caller], provisioned a private-networked Foundry account, and granted the project MI Repository Reader. The subsequent deploy failed from this public runner with the expected private endpoint 403, which is documented.
Live phase-5 validation showed hosted-agent image pull uses the Foundry project managed identity, not the parent account identity. Update the network E2E harness to resolve AZURE_AI_PROJECT_ID via ARM and grant the project MI the ABAC-aware Container Registry Repository Reader role on the BYO ACR, falling back to the account MI only for older API shapes. Also persist AZURE_TENANT_ID in the azd env so postdeploy hooks do not fail on VM/managed-identity runners after deploy succeeds.
Add a concise README cheatsheet for initializing, provisioning, deploying, and invoking a hosted Foundry agent with a BYO container image under VNet private networking. Include ACR requirements for ABAC and private-only registries.
Keep the extension README concise by moving the detailed Foundry private networking schema, requirements, and BYO-image VNet cheatsheet into `docs/private-networking.md`, with a short README pointer.
Live managed-network provisioning showed that the resources module emitted AZURE_FOUNDRY_MANAGED_ISOLATION_MODE but the subscription-scoped main template never forwarded it, so azd env only received AZURE_FOUNDRY_NETWORK_MODE. Forward the output from main.bicep, add it to the provider canonical output-name restore list, and cover ARM casing repair with a regression test. Also document the managed VNet provisioning scenario in the private-networking guide. Live validation: provisioned network.mode=managed in westus and verified the account had publicNetworkAccess Disabled, networkAcls Deny, networkInjections with useMicrosoftManagedNetwork=true, AZURE_FOUNDRY_NETWORK_MODE=managed, and AZURE_FOUNDRY_MANAGED_ISOLATION_MODE=AllowInternetOutbound.
Live managed-network deploy validation showed that managed mode configures the hosted-agent runtime to use a Microsoft-managed network but does not create a customer private endpoint for the Foundry data plane. Disabling public access in that mode makes azd deploy/invoke fail with `403 Public access is disabled`. Keep public data-plane access enabled for managed mode while preserving BYO mode behavior (public access disabled + private endpoint). Update the private networking guide with managed deploy/invoke guidance. Live validation: provisioned managed mode, converted the test ACR to ABAC, built the echo-dual image with `az acr build --source-acr-auth-id [caller]`, granted the Foundry project MI `Container Registry Repository Reader`, deployed successfully, and invoked the hosted agent successfully.
Realign the azure.yaml `network:` surface to the natural Azure resource shape and make a network-bound Foundry account private in every mode. Reverses the prior managed-mode regression that flipped the account's publicNetworkAccess back to Enabled. Service sample 18 confirms managed mode supports a private data plane (customer private endpoint + the Microsoft-managed egress network), so declaring `network:` now always disables public data-plane access. Config: flat `network:` block with two orthogonal axes, no `mode` enum. - peSubnet (required) -> account private endpoint; omitting it while `network:` is declared is an error, never a silent public fallback. - agentSubnet (optional) -> present injects the agent into a customer subnet (BYO egress); absent uses the Microsoft-managed network (managed egress), where isolationMode becomes valid. Synthesizer/templates: - derive egress from agentSubnet presence (useManagedEgress); replace the networkMode param with a useManagedEgress bool. - disablePublicDataPlaneAccess = enableNetworkIsolation (always private). - add a managedNetworks/default child resource carrying isolationMode for managed egress. - validate peSubnet-required, isolationMode-managed-only, and single-VNet. Docs/tests/e2e: - rewrite docs/private-networking.md (host: azure.ai.agent, the value the provision provider actually accepts on this branch). - add synthesizer unit tests + a compiled-ARM regression guard. - add a live E2E harness (8-cell what-if matrix, BYO + managed-iso real provisions, eject idempotency) with an automatic jumpbox SOCKS tunnel so deploy/invoke can reach the private data plane; assert real account network-injection state rather than azd's echoed output.
…n private-networking cheatsheet Managed-egress cheatsheet now scaffolds the agent via 'azd ai agent init --image' (writes agent.yaml) instead of assuming a hand-authored manifest, matching the BYO cheatsheet. Replace the env-output 'Expected outputs' block (azd echoing its own AZURE_FOUNDRY_* classification) with real resource-state validation: account publicNetworkAccess=Disabled and the managedNetworks/default isolationMode, with the invoke echo response as the end-to-end proof.
'azd ai agent init --image' scaffolds azure.yaml/agent.yaml, so it must come before the network: block the reader adds to the generated service. Matches the actual timing and the BYO cheatsheet ordering.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds declarative, secure-by-default private networking for
host: microsoft.foundryservices in the Azure AI agents extension. Declaring a
network:block on aFoundry-hosted service provisions a network-bound Foundry account/project from
azure.yamlwith the data plane private in every mode (accountpublicNetworkAccess: Disabled+ customer private endpoint).The config surface is flat and mirrors the natural Azure resource shape — two
orthogonal axes, no
modeenum:peSubnet— the account private endpoint. Its presenceis what makes the data plane private; omitting it while declaring
network:isan error (never a silent public fallback).
agentSubnet— present ⇒ the agent is injected into yoursubnet (BYO egress); absent ⇒ the Microsoft-managed network is used (managed
egress), where
isolationModebecomes valid.Changes
network:schema for Foundry-hosted services (agentSubnet,peSubnet,dns,isolationMode); the VNet rides on each subnet.publicNetworkAccess: Disabled(disablePublicDataPlaneAccess = enableNetworkIsolation)peSubnet(+ private DNS zones / VNet links, orreferenced existing zones via
dns:)networkInjectionspointing at that subnet
useManagedEgressinjection + amanagedNetworks/defaultchild resource carrying
isolationModeagentSubnetpresence (useManagedEgress = agentSubnet == nil);replace the
networkModetemplate param with auseManagedEgressbool.peSubnetrequired whennetwork:is declared;isolationModevalid only for managed egress; all subnets share one VNet.${VAR}placeholders duringazd ai agent init --infraeject; resolveonly at provision time.
networkInjections.subnetArmId(an inter-module
reference()is unresolved at the CognitiveServices RPpreflight and what-if does not catch it).
cli/azd/extensions/azure.ai.agents/test/e2e/network/.Test coverage
Three tiers; only the live tier creates resources.
create· DNS createcreate·AllowOnlyApprovedOutboundRUN_MANAGED_ISO)reference(existing subnet)AllowInternetOutboundreference(existing zones)peSubnetomitted whilenetwork:declaredendpoint:brownfield +network:The full 8-cell BYO/managed × create/reference × own/reference × isolation-mode
matrix is exercised by ARM what-if (template compiles and the CognitiveServices
RP accepts the shape, but nothing is created) and by deterministic synthesizer
unit tests. Two representative cells are live-provisioned end to end, and the
real deploy/invoke data path is validated against the BYO cell. Reference-subnet
mode,
AllowInternetOutbound, and referenced-DNS are what-if + unit-test only.E2E validation performed
Scenario 1 — Provision a private-networked Foundry from
azure.yamlpublicNetworkAccess: Disabled,networkAcls.defaultAction: Deny; Foundry private endpoint (accountgroup)in
peSubnet; agent subnet delegated toMicrosoft.App/environments; theaccount
networkInjectionsreferences the customer agent subnet; the threeprivatelink.*DNS zones created and linked.AllowOnlyApprovedOutboundinto adedicated VNet; verified the
managedNetworks/defaultchild resource wasaccepted with
isolationMode = AllowOnlyApprovedOutbound(the one thingwhat-if cannot confirm, since the V2 managed network is created, not planned).
Live assertions read real resource state (account properties, private
endpoint, subnet delegation, DNS zones, the account's network injection), not
azd's own output variables.
Scenario 2 — Eject preserves private-networking config
azd ai agent init --infraejects equivalent Bicep and preserves${VAR}placeholders (e.g.
${AZURE_VNET_ID}resolves at provision time, not eject time).The ejected template what-ifs as no changes against the already-provisioned
account (idempotent).
Scenario 3 — Deploy and invoke a hosted agent over the private data plane
Because the data plane is private in every mode, deploy/invoke must run with
line-of-sight to the private endpoint. The harness captures this automatically
(
lib-jumpbox.sh): it stands up a jumpbox VM (inside the foundry VNet, or apeered VNet in another region when the account region has no VM capacity) and
exposes it as a local SOCKS5 proxy, so
azd deploy/invokerun on the devhost with data-plane HTTPS tunneled into the VNet.
Validated against the live BYO account (
publicNetworkAccess: Disabled):Container Registry Repository Readeron anABAC-enabled ACR for the BYO image pull
azd deploy→ hosted agent reachedactiveazd ai agent invoke→ returned the expected echo responseA direct deploy/invoke from the public internet fails as expected with
403 Public access is disabled.Documentation
cli/azd/extensions/azure.ai.agents/docs/private-networking.mdcli/azd/extensions/azure.ai.agents/test/e2e/network/README.mdKnown limitations
agentSubnetandpeSubnet; cross-VNettopologies are deferred (require customer-managed peering + DNS-zone links).
dns-create account links the VNet to thethree AI
privatelink.*zones, and a VNet allows only one link per namespace —a second account (or brownfield hub) must use
dns:reference mode.