What happened:
In multi-SLB mode, after manually flipping a Service from external to internal, the service did not stay on its current LB configuration on the next reconcile.
The service was originally associated with one LB config (for example lb1). After the external->internal flip, the controller correctly reconciled the service onto the internal sibling LB (lb1-internal). But a
later reconcile no longer considered the service to be “currently on lb1” and instead tried to migrate it to another eligible LB that did not even exist yet (for example lb2-internal).
This looks like activeServices state for the current LB config is being removed during the opposite-LB cleanup reconcile.
What you expected to happen:
After flipping a service from external to internal in multi-SLB mode, the service should stay on the same LB configuration (lb1 -> lb1-internal) unless placement rules actually require it to move.
A follow-up reconcile should keep the service on its current LB configuration, not treat it as unplaced and attempt migration to another eligible LB.
How to reproduce it (as minimally and precisely as possible):
- Enable multi-SLB with at least two eligible LB configs, for example
lb1 and lb2.
- Create an external
type: LoadBalancer service that lands on lb1.
- Manually flip the service to internal by setting
service.beta.kubernetes.io/azure-load-balancer-internal: "true".
- Let the controller reconcile once so the service is moved onto
lb1-internal.
- Trigger another reconciliation for the same service.
- Observe that the controller may now try to move the service to another eligible LB, especially an eligible LB whose internal sibling resource does not exist yet (for example
lb2-internal), instead of keeping the
service on lb1-internal.
Anything else we need to know?:
I traced the failure path to ActiveServices bookkeeping in pkg/provider/azure_loadbalancer.go.
The problem appears to be:
reconcileService() first reconciles the desired service, then always runs a cleanup reconcile for the flipped service to remove the opposite LB flavor:
|
updateService := updateServiceLoadBalancerIPs(service, lbIPsPrimaryPIPs) |
|
flippedService := flipServiceInternalAnnotation(updateService) |
|
if _, _, err := az.reconcileLoadBalancer(ctx, clusterName, flippedService, nil, false /* wantLb */); err != nil { |
|
logger.Error(err, "Failed to reconcile flipped LoadBalancer") |
|
return nil, err |
|
} |
- During that cleanup reconcile, if the old frontend IP config is removed,
fipChanged becomes true and the controller updates multi-SLB status with wantLb=false:
|
if fipChanged { |
|
az.reconcileMultipleStandardLoadBalancerConfigurationStatus(wantLb, serviceName, lbName) |
|
if !wantLb { |
|
for i := len(newConfigs) - 1; i >= 0; i-- { |
|
config := newConfigs[i] |
|
isServiceOwnsFrontendIP, _, _ := az.serviceOwnsFrontendIP(ctx, config, service) |
|
if isServiceOwnsFrontendIP { |
|
unsafe, err := az.isFrontendIPConfigUnsafeToDelete(lb, service, config.ID) |
|
if err != nil { |
|
return nil, toDeleteConfigs, false, err |
|
} |
|
|
|
// If the frontend IP configuration is not being referenced by: |
|
// 1. loadBalancing rules of other services with different ports; |
|
// 2. outbound rules; |
|
// 3. inbound NAT rules; |
|
// 4. inbound NAT pools, |
|
// do the deletion, or skip it. |
|
if !unsafe { |
|
var configNameToBeDeleted string |
|
if newConfigs[i].Name != nil { |
|
configNameToBeDeleted = *newConfigs[i].Name |
|
logger.V(2).Info("dropping", "service", serviceName, "wantLb", wantLb, "configNameToBeDeleted", configNameToBeDeleted) |
|
} else { |
|
logger.V(2).Info("nil name", "service", serviceName, "wantLb", wantLb) |
|
} |
|
|
|
toDeleteConfigs = append(toDeleteConfigs, newConfigs[i]) |
|
newConfigs = append(newConfigs[:i], newConfigs[i+1:]...) |
reconcileMultipleStandardLoadBalancerConfigurationStatus() trims the -internal suffix before updating ActiveServices, so external and internal sibling LBs share the same config-level status bucket:
|
func (az *Cloud) reconcileMultipleStandardLoadBalancerConfigurationStatus(wantLb bool, svcName, lbName string) { |
|
logger := log.Background().WithName("reconcileMultipleStandardLoadBalancerConfigurationStatus") |
|
lbName = trimSuffixIgnoreCase(lbName, consts.InternalLoadBalancerNameSuffix) |
|
for i := range az.MultipleStandardLoadBalancerConfigurations { |
|
if strings.EqualFold(lbName, az.MultipleStandardLoadBalancerConfigurations[i].Name) { |
|
az.multipleStandardLoadBalancersActiveServicesLock.Lock() |
|
|
|
if wantLb { |
|
logger.V(4).Info("service is active on lb", "service", svcName, "lb", lbName) |
|
az.MultipleStandardLoadBalancerConfigurations[i].ActiveServices = utilsets.SafeInsert(az.MultipleStandardLoadBalancerConfigurations[i].ActiveServices, svcName) |
|
} else { |
|
logger.V(4).Info("service is not active on lb any more", "service", svcName, "lb", lbName) |
|
az.MultipleStandardLoadBalancerConfigurations[i].ActiveServices.Delete(svcName) |
|
} |
- That means cleanup of the old external LB can delete the same
ActiveServices["lb1"] entry that is needed to remember the service is now active on lb1-internal.
- On the next reconcile, current LB selection relies only on that config-level
ActiveServices state:
|
currentLBName := az.getServiceCurrentLoadBalancerName(service) |
|
lbNamePrefix = getMostEligibleLBForService(currentLBName, eligibleLBs, existingLBs, requiresInternalLoadBalancer(service)) |
|
func (az *Cloud) getServiceCurrentLoadBalancerName(service *v1.Service) string { |
|
for _, multiSLBConfig := range az.MultipleStandardLoadBalancerConfigurations { |
|
if az.isLoadBalancerInUseByService(service, multiSLBConfig) { |
|
return multiSLBConfig.Name |
|
} |
|
} |
|
return "" |
- Once the current LB name is empty,
getMostEligibleLBForService() explicitly prefers the first eligible LB that does not exist yet:
|
func getMostEligibleLBForService( |
|
currentLBName string, |
|
eligibleLBs []string, |
|
existingLBs []*armnetwork.LoadBalancer, |
|
isInternal bool, |
|
) string { |
|
logger := log.Background().WithName("getMostEligibleLBForService") |
|
// 1. If the LB is eligible and being used, choose it. |
|
if StringInSlice(currentLBName, eligibleLBs) { |
|
logger.V(4).Info("choose LB as it is eligible and being used", "currentLBName", currentLBName) |
|
return currentLBName |
|
} |
|
|
|
// 2. If the LB is eligible and not created yet, choose it because it has the fewest rules. |
|
for _, eligibleLB := range eligibleLBs { |
|
var found bool |
|
for i := range existingLBs { |
|
existingLB := (existingLBs)[i] |
|
if strings.EqualFold(trimSuffixIgnoreCase(ptr.Deref(existingLB.Name, ""), consts.InternalLoadBalancerNameSuffix), eligibleLB) && |
|
isInternalLoadBalancer(existingLB) == isInternal { |
|
found = true |
|
break |
|
} |
|
} |
|
|
|
if !found { |
|
logger.V(4).Info("choose LB as it is eligible and not existing", "eligibleLB", eligibleLB) |
|
return eligibleLB |
So the issue is not just the later selection logic. The underlying bug is that opposite-LB cleanup removes the config-level ActiveServices entry even when the service is still active on the sibling LB of the same
config.
This also looks symmetric for internal->external flips.
I also ran nearby provider tests locally:
go test -run 'TestGetMostEligibleLBName|TestGetServiceLoadBalancerMultiSLB|TestEnsureLoadBalancerDeleted' ./pkg/provider
Those passed, which suggests this path is missing regression coverage rather than already being covered by a failing test.
Environment:
- Kubernetes version (use
kubectl version): <fill in>
- Cloud provider or hardware configuration: Azure, Standard LoadBalancer SKU, multi-SLB enabled
- OS (e.g:
cat /etc/os-release): <fill in>
- Kernel (e.g.
uname -a): <fill in>
- Install tools:
<fill in>
- Network plugin and version (if this is a network-related bug):
<fill in>
What happened:
In multi-SLB mode, after manually flipping a
Servicefrom external to internal, the service did not stay on its current LB configuration on the next reconcile.The service was originally associated with one LB config (for example
lb1). After the external->internal flip, the controller correctly reconciled the service onto the internal sibling LB (lb1-internal). But alater reconcile no longer considered the service to be “currently on
lb1” and instead tried to migrate it to another eligible LB that did not even exist yet (for examplelb2-internal).This looks like
activeServicesstate for the current LB config is being removed during the opposite-LB cleanup reconcile.What you expected to happen:
After flipping a service from external to internal in multi-SLB mode, the service should stay on the same LB configuration (
lb1->lb1-internal) unless placement rules actually require it to move.A follow-up reconcile should keep the service on its current LB configuration, not treat it as unplaced and attempt migration to another eligible LB.
How to reproduce it (as minimally and precisely as possible):
lb1andlb2.type: LoadBalancerservice that lands onlb1.service.beta.kubernetes.io/azure-load-balancer-internal: "true".lb1-internal.lb2-internal), instead of keeping theservice on
lb1-internal.Anything else we need to know?:
I traced the failure path to
ActiveServicesbookkeeping inpkg/provider/azure_loadbalancer.go.The problem appears to be:
reconcileService()first reconciles the desired service, then always runs a cleanup reconcile for the flipped service to remove the opposite LB flavor:cloud-provider-azure/pkg/provider/azure_loadbalancer.go
Lines 174 to 179 in 4b5dcc8
fipChangedbecomes true and the controller updates multi-SLB status withwantLb=false:cloud-provider-azure/pkg/provider/azure_loadbalancer.go
Lines 2077 to 2078 in 4b5dcc8
cloud-provider-azure/pkg/provider/azure_loadbalancer.go
Lines 2582 to 2608 in 4b5dcc8
reconcileMultipleStandardLoadBalancerConfigurationStatus()trims the-internalsuffix before updatingActiveServices, so external and internal sibling LBs share the same config-level status bucket:cloud-provider-azure/pkg/provider/azure_loadbalancer.go
Lines 2446 to 2459 in 4b5dcc8
ActiveServices["lb1"]entry that is needed to remember the service is now active onlb1-internal.ActiveServicesstate:cloud-provider-azure/pkg/provider/azure_loadbalancer.go
Lines 4115 to 4116 in 4b5dcc8
cloud-provider-azure/pkg/provider/azure_loadbalancer.go
Lines 4180 to 4186 in 4b5dcc8
getMostEligibleLBForService()explicitly prefers the first eligible LB that does not exist yet:cloud-provider-azure/pkg/provider/azure_loadbalancer.go
Lines 4125 to 4152 in 4b5dcc8
So the issue is not just the later selection logic. The underlying bug is that opposite-LB cleanup removes the config-level
ActiveServicesentry even when the service is still active on the sibling LB of the sameconfig.
This also looks symmetric for internal->external flips.
I also ran nearby provider tests locally:
go test -run 'TestGetMostEligibleLBName|TestGetServiceLoadBalancerMultiSLB|TestEnsureLoadBalancerDeleted' ./pkg/providerThose passed, which suggests this path is missing regression coverage rather than already being covered by a failing test.
Environment:
kubectl version):<fill in>cat /etc/os-release):<fill in>uname -a):<fill in><fill in><fill in>