Skip to content

Foundry private networking for Azure AI agents#8708

Draft
m5i-work wants to merge 15 commits into
huimiu/foundry-azure-yamlfrom
m5i/foundry-private-network
Draft

Foundry private networking for Azure AI agents#8708
m5i-work wants to merge 15 commits into
huimiu/foundry-azure-yamlfrom
m5i/foundry-private-network

Conversation

@m5i-work

@m5i-work m5i-work commented Jun 18, 2026

Copy link
Copy Markdown
Member

Summary

Adds declarative, secure-by-default private networking for host: microsoft.foundry
services in the Azure AI agents extension. Declaring a network: block on a
Foundry-hosted service provisions a network-bound Foundry account/project from
azure.yaml with the data plane private in every mode (account
publicNetworkAccess: Disabled + customer private endpoint).

The config surface is flat and mirrors the natural Azure resource shape — two
orthogonal axes, no mode enum:

  • Ingress (required): peSubnet — the account private endpoint. Its presence
    is what makes the data plane private; omitting it while declaring network: is
    an error (never a silent public fallback).
  • Egress (optional): agentSubnet — present ⇒ the agent is injected into your
    subnet (BYO egress); absent ⇒ the Microsoft-managed network is used (managed
    egress), where isolationMode becomes valid.
# BYO egress (agent injected into your subnet)
services:
  agent:
    host: azure.ai.agent
    network:
      agentSubnet:
        vnet: ${AZURE_VNET_ID}
        name: agent-subnet
        prefix: 192.168.10.0/24   # omit prefix to reference an existing subnet
      peSubnet:
        vnet: ${AZURE_VNET_ID}
        name: pe-subnet
        prefix: 192.168.11.0/24
      dns:                         # optional; omit to let azd create + link zones
        resourceGroup: rg-private-dns
# Managed egress (no agentSubnet; isolationMode now valid)
services:
  agent:
    host: azure.ai.agent
    network:
      isolationMode: AllowOnlyApprovedOutbound
      peSubnet:
        vnet: ${AZURE_VNET_ID}
        name: pe-subnet
        prefix: 192.168.11.0/24

Changes

  • Add the flat network: schema for Foundry-hosted services (agentSubnet,
    peSubnet, dns, isolationMode); the VNet rides on each subnet.
  • Synthesize Bicep that is private in every network-bound mode:
    • account publicNetworkAccess: Disabled (disablePublicDataPlaneAccess = enableNetworkIsolation)
    • Foundry private endpoint in peSubnet (+ private DNS zones / VNet links, or
      referenced existing zones via dns:)
    • BYO egress: delegated hosted-agent subnet + account networkInjections
      pointing at that subnet
    • managed egress: useManagedEgress injection + a managedNetworks/default
      child resource carrying isolationMode
  • Derive egress from agentSubnet presence (useManagedEgress = agentSubnet == nil);
    replace the networkMode template param with a useManagedEgress bool.
  • Validate at synthesis time: peSubnet required when network: is declared;
    isolationMode valid only for managed egress; all subnets share one VNet.
  • Preserve ${VAR} placeholders during azd ai agent init --infra eject; resolve
    only at provision time.
  • Use a deterministic subnet ARM id string for networkInjections.subnetArmId
    (an inter-module reference() is unresolved at the CognitiveServices RP
    preflight and what-if does not catch it).
  • Add docs, schema, unit tests, and a live E2E harness under
    cli/azd/extensions/azure.ai.agents/test/e2e/network/.

Note on the prior approach: an earlier revision re-enabled
publicNetworkAccess in managed mode. That was a security regression — the
service supports a private data plane in managed mode too (a customer private
endpoint with Microsoft-managed egress). Managed mode is now private by
default, consistent with BYO mode.

Test coverage

Three tiers; only the live tier creates resources.

Combination Live-provisioned ARM what-if (no creation) Synthesizer unit test
BYO egress · subnet create · DNS create ✅ (+ deploy/invoke)
Managed egress · create · AllowOnlyApprovedOutbound ✅ (gated RUN_MANAGED_ISO)
Subnet reference (existing subnet) ✅ (BYO + managed)
Managed AllowInternetOutbound
DNS reference (existing zones)
peSubnet omitted while network: declared ✅ (expected error)
endpoint: brownfield + network: ✅ (network ignored)

The full 8-cell BYO/managed × create/reference × own/reference × isolation-mode
matrix is exercised by ARM what-if (template compiles and the CognitiveServices
RP accepts the shape, but nothing is created) and by deterministic synthesizer
unit tests. Two representative cells are live-provisioned end to end, and the
real deploy/invoke data path is validated against the BYO cell. Reference-subnet
mode, AllowInternetOutbound, and referenced-DNS are what-if + unit-test only.

E2E validation performed

Scenario 1 — Provision a private-networked Foundry from azure.yaml

  • BYO egress (live): account publicNetworkAccess: Disabled,
    networkAcls.defaultAction: Deny; Foundry private endpoint (account group)
    in peSubnet; agent subnet delegated to Microsoft.App/environments; the
    account networkInjections references the customer agent subnet; the three
    privatelink.* DNS zones created and linked.
  • Managed egress (live, gated): provisioned AllowOnlyApprovedOutbound into a
    dedicated VNet; verified the managedNetworks/default child resource was
    accepted with isolationMode = AllowOnlyApprovedOutbound (the one thing
    what-if cannot confirm, since the V2 managed network is created, not planned).

Live assertions read real resource state (account properties, private
endpoint, subnet delegation, DNS zones, the account's network injection), not
azd's own output variables.

Scenario 2 — Eject preserves private-networking config

azd ai agent init --infra ejects equivalent Bicep and preserves ${VAR}
placeholders (e.g. ${AZURE_VNET_ID} resolves at provision time, not eject time).
The ejected template what-ifs as no changes against the already-provisioned
account (idempotent).

Scenario 3 — Deploy and invoke a hosted agent over the private data plane

Because the data plane is private in every mode, deploy/invoke must run with
line-of-sight to the private endpoint. The harness captures this automatically
(lib-jumpbox.sh): it stands up a jumpbox VM (inside the foundry VNet, or a
peered VNet in another region when the account region has no VM capacity) and
exposes it as a local SOCKS5 proxy, so azd deploy/invoke run on the dev
host with data-plane HTTPS tunneled into the VNet.

Validated against the live BYO account (publicNetworkAccess: Disabled):

  • the Foundry project MI granted Container Registry Repository Reader on an
    ABAC-enabled ACR for the BYO image pull
  • azd deploy → hosted agent reached active
  • azd ai agent invoke → returned the expected echo response

A direct deploy/invoke from the public internet fails as expected with
403 Public access is disabled.

Documentation

Known limitations

  • v1 BYO supports a single VNet shared by agentSubnet and peSubnet; cross-VNet
    topologies are deferred (require customer-managed peering + DNS-zone links).
  • One default-DNS account per VNet: a dns-create account links the VNet to the
    three AI privatelink.* zones, and a VNet allows only one link per namespace —
    a second account (or brownfield hub) must use dns: reference mode.

m5i-work added 8 commits June 18, 2026 13:28
Add a declarative network: block to the Foundry service in azure.yaml and
teach the bicep-less synthesizer to provision a VNet-bound (network-secured)
account from it. Additive: an absent block yields today's public account.

- schema: network: surface (mode byo|managed, byo vnet/subnets tri-state,
  managed isolationMode, dns create-or-reference) on microsoft.foundry.json
- synthesizer: decode network:, resolve ${VAR}, validate (mode coherence,
  vnet ARM id, subnet tri-state/CIDR, DNS rg/sub), emit network params +
  NetworkMode for telemetry
- templates: new modules/network.bicep, subnet.bicep, private-endpoint-dns.bicep;
  resources.bicep/main.bicep guard the network path on enableNetworkIsolation
  (publicNetworkAccess Disabled, networkAcls Deny, agent networkInjections,
  account private endpoint + 3 AI DNS zones); main.arm.json regenerated
- provider: pass azd env for ${VAR} resolution, emit provision.network_mode,
  warn that network: is ignored when endpoint: (brownfield) is set
- docs/tests: synthesizer network tests, eject module assertions, extension
  README network section, telemetry-data.md provision.network_mode
The existing on-disk provision flow resolves ${VAR} in main.parameters.json
from the azd environment at provision time. Eject must therefore keep ${VAR}
references verbatim instead of resolving them eagerly from the process env and
freezing a literal into the ejected file.

- synthesis.Input gains PreserveVarRefs; when set, byo.vnet.id and
  dns.subscription pass through verbatim and the format checks that cannot run
  on an unexpanded placeholder are skipped (concrete-but-malformed still fails)
- eject (init --infra) sets PreserveVarRefs so the ejected main.parameters.json
  stays environment-portable; the provision path still resolves and validates
- tests: synthesizer preserve-mode (pass-through + concrete validation) and an
  eject e2e asserting ${AZURE_VNET_ID}/${AZURE_DNS_SUBSCRIPTION_ID} survive
Bash E2E harness validating host: microsoft.foundry private networking,
designed to minimize Azure resource-operation time:

- ONE real network account is provisioned (create+own matrix cell) with a BYO
  --image agent, then deploy + invoke prove the agent works under the VNet.
- Scenario 1 (bicep-less) and the other 3 matrix cells (subnet create/reference
  x DNS own/reference) are verified with 'azd provision --preview' (ARM what-if),
  which creates nothing.
- Scenario 2 (eject) is verified against the live account: eject -> what-if
  reports no changes (idempotent), proving the on-disk template + provision-time
  ${VAR} resolution reproduces the in-memory topology; a manual infra/ edit then
  surfaces as the only delta. Guards the ${VAR}-preservation fix end-to-end.
- A shared BYO VNet (+ reference subnets / external DNS zones) is created once
  and reused across cells.

Files: run-network-e2e.sh (phases 0-6 orchestrator), assert-resources.sh (live
az topology checks: publicNetworkAccess Disabled, account PE groupIds, 3 AI DNS
zones, agent-subnet delegation), lib.sh (logging/assert/azure.yaml mutation),
README.md (cost rationale, prerequisites, cleanup). Westus account region per
requirement; AcrPull granted to the project MI on the ABAC registry.
Decouple the private-networking E2E from the BYO-image init UX (PR 8689) so it
runs against the current branch today:

- Replace 'azd ai agent init --image' with a hand-authored azure.yaml fixture
  (foundry service + network: block + agent image:), created via 'azd env new'.
  image: yields includeAcr=false, matching BYO image, so no ACR at provision.
  Verified the fixture synthesizes: mode=byo, enableNetworkIsolation=true,
  includeAcr=false, ${VAR} resolved.
- Gate phase 5 (deploy + invoke) behind RUN_DEPLOY=true: the headless BYO-image
  deploy needs the AZD_AGENT_SKIP_ACR short-circuit from PR 8689, otherwise
  deploy defaults to build and fails. Phases 0-4 (local gates, shared VNet,
  what-if matrix, one real provision, eject idempotency) validate all the
  networking code without it.
- Fix the ABAC registry role: grant 'Container Registry Repository Reader'
  (ABAC-aware) instead of AcrPull; move the grant into the gated deploy phase.
- Drop the --image preflight; README updated (scenario table, prerequisites,
  RUN_DEPLOY usage, role).
…sing

Two product bugs surfaced by live E2E provisioning (ARM what-if does not catch
either; both require a real deployment):

1. networkInjections preflight failure. The account and the network module
   deploy in the same template, so subnetArmId: network!.outputs.agentSubnetId
   compiled to an unresolved reference() at the CognitiveServices RP preflight,
   which then failed to convert networkInjections to its typed contract
   (InvalidResourceProperties). Build the subnet ARM id as a deterministic
   string from the concrete vnetId param instead, and add an explicit dependsOn
   so the subnet still exists before injection. Recompiled main.arm.json.

2. AZURE_FOUNDRY_NETWORK_MODE missing from canonicalOutputNames. ARM mangles
   output-name casing (AZURE_..._MODE -> azurE_..._MODE); without the canonical
   remapping the env key was stored mis-cased and azd env get-value
   AZURE_FOUNDRY_NETWORK_MODE returned empty. Added it to the restore list and a
   regression case to TestArmOutputsToProto_RepairsMangledKeyCase.

Validated end-to-end: real westus network-isolated Foundry account provisions
green with all topology assertions passing (publicNetworkAccess Disabled,
networkAcls Deny, private endpoint, agent-subnet delegation, 3 AI DNS zones,
network mode byo), across the full subnet create/reference x DNS own/reference
matrix, plus eject idempotency (what-if reports no changes).
Fixes found while running the harness against live Azure (phases 0-4):

- Hand-authored project must include an agent.yaml (kind: hosted + image:)
  alongside azure.yaml; the foundry provider requires an agent definition file.
- setup_project now sets AZURE_RESOURCE_GROUP (the subscription-scoped template
  creates the RG but the provider needs the name) and AZD_AGENT_SKIP_ACR=true
  (BYO-image deploy signal).
- Phase 0 refreshes the dev extension from current source
  (build -> pack -> publish -> install) so the run tests local code, registering
  the provisioning-provider capability + microsoft.foundry provider. Gated by
  SKIP_EXT_REFRESH.
- What-if matrix gates on a successful ARM what-if (exit 0) rather than grepping
  a summary-only preview; this still validates reference-mode subnet/zone
  existence and delegation against the real VNet.
- Idempotent private-dns zone creation (reruns no longer fail on existing zones).
- Add MAX_PHASE to stop early while iterating.
- ACR grant uses the ABAC-aware Container Registry Repository Reader role.
- Fix set -u unbound-variable crash in the phase-4 assert message.
- .gitignore the transient per-run log directories.

Phases 0-4 (local gates, shared VNet, what-if matrix, one real provision +
topology assertions, eject idempotency) pass green. Phase 5 (deploy + invoke)
stays gated behind RUN_DEPLOY=true and needs a reachable BYO agent image.
Update the Foundry private-network E2E harness so phase 5 can build the
~/agents/echo-dual image itself instead of requiring a prebuilt external image.

- Add BUILD_IMAGE=true, ECHO_DUAL_DIR, ACR_NAME/ACR_RG, IMAGE_REPO/IMAGE_TAG.
- Create the target ACR with --role-assignment-mode rbac-abac and reject reuse
  of non-ABAC registries.
- Grant the caller Container Registry Repository Writer before the ACR Task push.
  Resolve the caller object id from the ARM token oid claim to avoid Microsoft
  Graph / Conditional Access failures.
- Build with the required `az acr build --source-acr-auth-id [caller]` form.
- Keep the project MI grant on the ABAC-aware Container Registry Repository
  Reader role for image pull.
- Add TARGET_RG support so investigation runs can keep VNet, DNS, ACR, and the
  real Foundry env in a single RG.

Live validation: the harness created an ABAC ACR, granted caller writer, built
and pushed ~/agents/echo-dual with --source-acr-auth-id [caller], provisioned a
private-networked Foundry account, and granted the project MI Repository Reader.
The subsequent deploy failed from this public runner with the expected private
endpoint 403, which is documented.
Live phase-5 validation showed hosted-agent image pull uses the Foundry project
managed identity, not the parent account identity. Update the network E2E
harness to resolve AZURE_AI_PROJECT_ID via ARM and grant the project MI the
ABAC-aware Container Registry Repository Reader role on the BYO ACR, falling
back to the account MI only for older API shapes.

Also persist AZURE_TENANT_ID in the azd env so postdeploy hooks do not fail on
VM/managed-identity runners after deploy succeeds.
@github-actions github-actions Bot added ext-agents azure.ai.agents extension ext-foundry azure.ai.{agents,connections,inspector,projects,routines,skills,toolboxes}, microsoft.foundry labels Jun 18, 2026
m5i-work added 7 commits June 18, 2026 17:58
Add a concise README cheatsheet for initializing, provisioning, deploying, and
invoking a hosted Foundry agent with a BYO container image under VNet private
networking. Include ACR requirements for ABAC and private-only registries.
Keep the extension README concise by moving the detailed Foundry private
networking schema, requirements, and BYO-image VNet cheatsheet into
`docs/private-networking.md`, with a short README pointer.
Live managed-network provisioning showed that the resources module emitted
AZURE_FOUNDRY_MANAGED_ISOLATION_MODE but the subscription-scoped main template
never forwarded it, so azd env only received AZURE_FOUNDRY_NETWORK_MODE.

Forward the output from main.bicep, add it to the provider canonical output-name
restore list, and cover ARM casing repair with a regression test. Also document
the managed VNet provisioning scenario in the private-networking guide.

Live validation: provisioned network.mode=managed in westus and verified the
account had publicNetworkAccess Disabled, networkAcls Deny, networkInjections
with useMicrosoftManagedNetwork=true, AZURE_FOUNDRY_NETWORK_MODE=managed, and
AZURE_FOUNDRY_MANAGED_ISOLATION_MODE=AllowInternetOutbound.
Live managed-network deploy validation showed that managed mode configures the
hosted-agent runtime to use a Microsoft-managed network but does not create a
customer private endpoint for the Foundry data plane. Disabling public access in
that mode makes azd deploy/invoke fail with `403 Public access is disabled`.

Keep public data-plane access enabled for managed mode while preserving BYO
mode behavior (public access disabled + private endpoint). Update the private
networking guide with managed deploy/invoke guidance.

Live validation: provisioned managed mode, converted the test ACR to ABAC,
built the echo-dual image with `az acr build --source-acr-auth-id [caller]`,
granted the Foundry project MI `Container Registry Repository Reader`, deployed
successfully, and invoked the hosted agent successfully.
Realign the azure.yaml `network:` surface to the natural Azure resource
shape and make a network-bound Foundry account private in every mode.

Reverses the prior managed-mode regression that flipped the account's
publicNetworkAccess back to Enabled. Service sample 18 confirms managed
mode supports a private data plane (customer private endpoint + the
Microsoft-managed egress network), so declaring `network:` now always
disables public data-plane access.

Config: flat `network:` block with two orthogonal axes, no `mode` enum.
- peSubnet (required) -> account private endpoint; omitting it while
  `network:` is declared is an error, never a silent public fallback.
- agentSubnet (optional) -> present injects the agent into a customer
  subnet (BYO egress); absent uses the Microsoft-managed network
  (managed egress), where isolationMode becomes valid.

Synthesizer/templates:
- derive egress from agentSubnet presence (useManagedEgress); replace the
  networkMode param with a useManagedEgress bool.
- disablePublicDataPlaneAccess = enableNetworkIsolation (always private).
- add a managedNetworks/default child resource carrying isolationMode for
  managed egress.
- validate peSubnet-required, isolationMode-managed-only, and single-VNet.

Docs/tests/e2e:
- rewrite docs/private-networking.md (host: azure.ai.agent, the value the
  provision provider actually accepts on this branch).
- add synthesizer unit tests + a compiled-ARM regression guard.
- add a live E2E harness (8-cell what-if matrix, BYO + managed-iso real
  provisions, eject idempotency) with an automatic jumpbox SOCKS tunnel so
  deploy/invoke can reach the private data plane; assert real account
  network-injection state rather than azd's echoed output.
…n private-networking cheatsheet

Managed-egress cheatsheet now scaffolds the agent via 'azd ai agent init
--image' (writes agent.yaml) instead of assuming a hand-authored manifest,
matching the BYO cheatsheet.

Replace the env-output 'Expected outputs' block (azd echoing its own
AZURE_FOUNDRY_* classification) with real resource-state validation:
account publicNetworkAccess=Disabled and the managedNetworks/default
isolationMode, with the invoke echo response as the end-to-end proof.
'azd ai agent init --image' scaffolds azure.yaml/agent.yaml, so it must
come before the network: block the reader adds to the generated service.
Matches the actual timing and the BYO cheatsheet ordering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ext-agents azure.ai.agents extension ext-foundry azure.ai.{agents,connections,inspector,projects,routines,skills,toolboxes}, microsoft.foundry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant