Skip to content

Upgrade stuck when upgrading from 0.20.0 to 0.24.0 with OCI based components #966

@a13x5

Description

@a13x5

What steps did you take and what happened:

Repro steps
  1. Use at least 2 providers: core and control-plane which are installed with OCI based components (i.e. .spec.fetchConfig.oci).
  2. Run upgrade of controller from version 0.20.0 to 0.24.0

Note

The issue doesn't get reproduced reliably. Sometimes upgrade may succeed with exactly same environment.

Actual result

The upgrade stuck indefinitely with the following condition in the CoreProvider status:

status:
  conditions:
  - lastTransitionTime: "2025-12-16T09:30:20Z"
    message: ""
    reason: MinimumReplicasAvailable
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-12-19T15:54:08Z"
    message: All preflight checks passed
    observedGeneration: 4
    reason: PreflightChecksPassed
    status: "True"
    type: PreflightCheckPassed
  - lastTransitionTime: "2025-12-19T15:54:15Z"
    message: Provider installed successfully
    observedGeneration: 3
    reason: ProviderInstalled
    status: "True"
    type: ProviderInstalled
  - lastTransitionTime: "2025-12-18T15:36:25Z"
    message: config map not found
    observedGeneration: 4
    reason: ComponentsUpgradeError
    status: "False"
    type: ProviderUpgraded

The same error could be found in the logs:

E1219 16:27:06.703347       1 controller.go:353] "Reconciler error" err="config map not found" controller="coreprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="CoreProvider" CoreProvider="kcm-system/cluster-api" namespace="kcm-system" name="cluster-api" reconcileID="8c4c156a-5454-4298-89ec-7a5464c4c4ab"

However both config maps (the old and the new one) are present on the system:

core-cluster-api-v1.10.3                                     2      91d
core-cluster-api-v1.11.2                                     2      2d20h

What did you expect to happen:

Upgrade is succeeded without any issues.

Anything else you would like to add:

Additional investigation

Apart from the aforementioned issues I noticed that there is an additional issue in the logs

E0112 12:23:37.826685       1 controller.go:353] "Reconciler error" err="collected artifact needs to provide components as control-plane-k0sproject-k0smotron-v1.8.0-components.yaml or control
-plane-components.yaml or components.yaml file" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="kcm-s
ystem/k0sproject-k0smotron" namespace="kcm-system" name="k0sproject-k0smotron" reconcileID="8f60fc6c-aad8-41af-a594-32fd55a289a7"

Upon further investigation there is an issue with the control plane components file rename (ref: #858) which means that old version expected to have controlplane-components.yaml in the OCI registry and the new one expects control-plane-components.yaml which makes components incompatible around versions, specifically 0.20.0 and 0.24.0.

Also it looks like new version tries to reconcile old components and can't find ConfigMap, since ConfigMap name depends on the file name in the OCI repository, e.g. we have controlplane-k0sproject-k0smotron-v1.6.0, but expected is control-plane-k0sproject-k0smotron-v1.6.0.

To fix the issue control plane components were uploaded with both names (control-plane-components.yaml and controlplane-components.yaml) after that capi operator could fetch all of them and resulted following ConfigMaps:

control-plane-k0sproject-k0smotron-v1.10.0                   2      88s
control-plane-k0sproject-k0smotron-v1.6.0                    2      114s
controlplane-k0sproject-k0smotron-v1.6.0                     2      3h55m

Summary

The workaround for issue is simple, but what I found mos concerning is that probably conditions are getting mixed between different providers somehow, since config map not found makes sense in scope of control plane provider, since config map with new naming doesn't actually exist.

Additional observation is that Phases may not run sequentially, e.g. I see installation errors before components even fetched (and configmaps created), literally seconds after start of reconcile, which is something I wouldn't expect.

Probably it's connected somehow to recent Phases refactoring #826 but I'm not familiar with the intention or code base to claim anything, evaluation from maintainers is needed.

Environment:

  • Cluster-api-operator version: 0.20.0 and 0.24.0
  • Cluster-api version: 1.10.3 and 1.11.2
  • Minikube/KIND version: N/A
  • Kubernetes version: (use kubectl version): 1.32
  • OS (e.g. from /etc/os-release): Rocky 9 and Ubuntu 24.04

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api-operator/labels?q=area for the list of labels]

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions