[Multi-SLB] external->internal transition can lose current-LB hint and target sibling LB

  #### What happened:

  In multi-SLB mode, after manually flipping a `Service` from external to internal, the service did not stay on its current LB configuration on the next reconcile.

  The service was originally associated with one LB config (for example `lb1`). After the external->internal flip, the controller correctly reconciled the service onto the internal sibling LB (`lb1-internal`). But a
  later reconcile no longer considered the service to be “currently on `lb1`” and instead tried to migrate it to another eligible LB that did not even exist yet (for example `lb2-internal`).

  This looks like `activeServices` state for the current LB config is being removed during the opposite-LB cleanup reconcile.

  #### What you expected to happen:

  After flipping a service from external to internal in multi-SLB mode, the service should stay on the same LB configuration (`lb1` -> `lb1-internal`) unless placement rules actually require it to move.

  A follow-up reconcile should keep the service on its current LB configuration, not treat it as unplaced and attempt migration to another eligible LB.

  #### How to reproduce it (as minimally and precisely as possible):

  1. Enable multi-SLB with at least two eligible LB configs, for example `lb1` and `lb2`.
  2. Create an external `type: LoadBalancer` service that lands on `lb1`.
  3. Manually flip the service to internal by setting `service.beta.kubernetes.io/azure-load-balancer-internal: "true"`.
  4. Let the controller reconcile once so the service is moved onto `lb1-internal`.
  5. Trigger another reconciliation for the same service.
  6. Observe that the controller may now try to move the service to another eligible LB, especially an eligible LB whose internal sibling resource does not exist yet (for example `lb2-internal`), instead of keeping the
  service on `lb1-internal`.

  #### Anything else we need to know?:

  I traced the failure path to `ActiveServices` bookkeeping in `pkg/provider/azure_loadbalancer.go`.

  The problem appears to be:

  - `reconcileService()` first reconciles the desired service, then always runs a cleanup reconcile for the flipped service to remove the opposite LB flavor:
    - https://github.com/kubernetes-sigs/cloud-provider-azure/blob/4b5dcc8807a02efd86e94f5b8bf7fc199ae3ddeb/pkg/provider/azure_loadbalancer.go#L174-L179
  - During that cleanup reconcile, if the old frontend IP config is removed, `fipChanged` becomes true and the controller updates multi-SLB status with `wantLb=false`:
    - https://github.com/kubernetes-sigs/cloud-provider-azure/blob/4b5dcc8807a02efd86e94f5b8bf7fc199ae3ddeb/pkg/provider/azure_loadbalancer.go#L2077-L2078
    - https://github.com/kubernetes-sigs/cloud-provider-azure/blob/4b5dcc8807a02efd86e94f5b8bf7fc199ae3ddeb/pkg/provider/azure_loadbalancer.go#L2582-L2608
  - `reconcileMultipleStandardLoadBalancerConfigurationStatus()` trims the `-internal` suffix before updating `ActiveServices`, so external and internal sibling LBs share the same config-level status bucket:
    - https://github.com/kubernetes-sigs/cloud-provider-azure/blob/4b5dcc8807a02efd86e94f5b8bf7fc199ae3ddeb/pkg/provider/azure_loadbalancer.go#L2446-L2459
  - That means cleanup of the old external LB can delete the same `ActiveServices["lb1"]` entry that is needed to remember the service is now active on `lb1-internal`.
  - On the next reconcile, current LB selection relies only on that config-level `ActiveServices` state:
    - https://github.com/kubernetes-sigs/cloud-provider-azure/blob/4b5dcc8807a02efd86e94f5b8bf7fc199ae3ddeb/pkg/provider/azure_loadbalancer.go#L4115-L4116
    - https://github.com/kubernetes-sigs/cloud-provider-azure/blob/4b5dcc8807a02efd86e94f5b8bf7fc199ae3ddeb/pkg/provider/azure_loadbalancer.go#L4180-L4186
  - Once the current LB name is empty, `getMostEligibleLBForService()` explicitly prefers the first eligible LB that does not exist yet:
    - https://github.com/kubernetes-sigs/cloud-provider-azure/blob/4b5dcc8807a02efd86e94f5b8bf7fc199ae3ddeb/pkg/provider/azure_loadbalancer.go#L4125-L4152

  So the issue is not just the later selection logic. The underlying bug is that opposite-LB cleanup removes the config-level `ActiveServices` entry even when the service is still active on the sibling LB of the same
  config.

  This also looks symmetric for internal->external flips.

  I also ran nearby provider tests locally:
  `go test -run 'TestGetMostEligibleLBName|TestGetServiceLoadBalancerMultiSLB|TestEnsureLoadBalancerDeleted' ./pkg/provider`
  Those passed, which suggests this path is missing regression coverage rather than already being covered by a failing test.

  #### Environment:

  - Kubernetes version (use `kubectl version`): `<fill in>`
  - Cloud provider or hardware configuration: Azure, Standard LoadBalancer SKU, multi-SLB enabled
  - OS (e.g: `cat /etc/os-release`): `<fill in>`
  - Kernel (e.g. `uname -a`): `<fill in>`
  - Install tools: `<fill in>`
  - Network plugin and version (if this is a network-related bug): `<fill in>`

	updateService := updateServiceLoadBalancerIPs(service, lbIPsPrimaryPIPs)
	flippedService := flipServiceInternalAnnotation(updateService)
	if _, _, err := az.reconcileLoadBalancer(ctx, clusterName, flippedService, nil, false /* wantLb */); err != nil {
	logger.Error(err, "Failed to reconcile flipped LoadBalancer")
	return nil, err
	}

	if !wantLb {
	for i := len(newConfigs) - 1; i >= 0; i-- {
	config := newConfigs[i]
	isServiceOwnsFrontendIP, _, _ := az.serviceOwnsFrontendIP(ctx, config, service)
	if isServiceOwnsFrontendIP {
	unsafe, err := az.isFrontendIPConfigUnsafeToDelete(lb, service, config.ID)
	if err != nil {
	return nil, toDeleteConfigs, false, err
	}

	// If the frontend IP configuration is not being referenced by:
	// 1. loadBalancing rules of other services with different ports;
	// 2. outbound rules;
	// 3. inbound NAT rules;
	// 4. inbound NAT pools,
	// do the deletion, or skip it.
	if !unsafe {
	var configNameToBeDeleted string
	if newConfigs[i].Name != nil {
	configNameToBeDeleted = *newConfigs[i].Name
	logger.V(2).Info("dropping", "service", serviceName, "wantLb", wantLb, "configNameToBeDeleted", configNameToBeDeleted)
	} else {
	logger.V(2).Info("nil name", "service", serviceName, "wantLb", wantLb)
	}

	toDeleteConfigs = append(toDeleteConfigs, newConfigs[i])
	newConfigs = append(newConfigs[:i], newConfigs[i+1:]...)

	func (az *Cloud) reconcileMultipleStandardLoadBalancerConfigurationStatus(wantLb bool, svcName, lbName string) {
	logger := log.Background().WithName("reconcileMultipleStandardLoadBalancerConfigurationStatus")
	lbName = trimSuffixIgnoreCase(lbName, consts.InternalLoadBalancerNameSuffix)
	for i := range az.MultipleStandardLoadBalancerConfigurations {
	if strings.EqualFold(lbName, az.MultipleStandardLoadBalancerConfigurations[i].Name) {
	az.multipleStandardLoadBalancersActiveServicesLock.Lock()

	if wantLb {
	logger.V(4).Info("service is active on lb", "service", svcName, "lb", lbName)
	az.MultipleStandardLoadBalancerConfigurations[i].ActiveServices = utilsets.SafeInsert(az.MultipleStandardLoadBalancerConfigurations[i].ActiveServices, svcName)
	} else {
	logger.V(4).Info("service is not active on lb any more", "service", svcName, "lb", lbName)
	az.MultipleStandardLoadBalancerConfigurations[i].ActiveServices.Delete(svcName)
	}

	func (az Cloud) getServiceCurrentLoadBalancerName(service v1.Service) string {
	for _, multiSLBConfig := range az.MultipleStandardLoadBalancerConfigurations {
	if az.isLoadBalancerInUseByService(service, multiSLBConfig) {
	return multiSLBConfig.Name
	}
	}
	return ""

	func getMostEligibleLBForService(
	currentLBName string,
	eligibleLBs []string,
	existingLBs []*armnetwork.LoadBalancer,
	isInternal bool,
	) string {
	logger := log.Background().WithName("getMostEligibleLBForService")
	// 1. If the LB is eligible and being used, choose it.
	if StringInSlice(currentLBName, eligibleLBs) {
	logger.V(4).Info("choose LB as it is eligible and being used", "currentLBName", currentLBName)
	return currentLBName
	}

	// 2. If the LB is eligible and not created yet, choose it because it has the fewest rules.
	for _, eligibleLB := range eligibleLBs {
	var found bool
	for i := range existingLBs {
	existingLB := (existingLBs)[i]
	if strings.EqualFold(trimSuffixIgnoreCase(ptr.Deref(existingLB.Name, ""), consts.InternalLoadBalancerNameSuffix), eligibleLB) &&
	isInternalLoadBalancer(existingLB) == isInternal {
	found = true
	break
	}
	}

	if !found {
	logger.V(4).Info("choose LB as it is eligible and not existing", "eligibleLB", eligibleLB)
	return eligibleLB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Multi-SLB] external->internal transition can lose current-LB hint and target sibling LB #10050

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if fipChanged {
	az.reconcileMultipleStandardLoadBalancerConfigurationStatus(wantLb, serviceName, lbName)

	currentLBName := az.getServiceCurrentLoadBalancerName(service)
	lbNamePrefix = getMostEligibleLBForService(currentLBName, eligibleLBs, existingLBs, requiresInternalLoadBalancer(service))

[Multi-SLB] external->internal transition can lose current-LB hint and target sibling LB #10050

Description

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions