Skip to content

fix: allow capacity http errors to requeue#1326

Merged
britaniar merged 12 commits into
Azure:mainfrom
britaniar:fixCapacityErrors
Jun 26, 2026
Merged

fix: allow capacity http errors to requeue#1326
britaniar merged 12 commits into
Azure:mainfrom
britaniar:fixCapacityErrors

Conversation

@britaniar

@britaniar britaniar commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description of your changes

When a querying the vm size recommender client, there may be times the client can run into transient client errors that will resolve on retry. At the moment the scheduler only runs once and sees all errors as unexpected. Now the possibility of transient http errors is possible with the addition of using azure property checker as we query an azure client. Therefore, these should not be considered unexpected, and the scheduler should be able to requeue on these errors.

Fixes #

I have:

  • Updated the scheduler to requeue on http errors.

  • Simplified http error logging to be one line for easier debugging.

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

Simplified logging:

## BEFORE
E0612 02:18:53.167602 1 compute/vmsizerecommenderclient.go:122] "Failed to generate VM size re...
request failed with status 503: POST http://127.0.0.1:5859/compute/subscriptions/a5191316-d253-43e3...

## AFTER
E0623 12:38:21.725116  158733 vmsizerecommenderclient.go:121] "Failed to generate VM size recommendations" err="request failed with status 503: POST http://127.0.0.1:35911/subscriptions/sub-123/providers/Microsoft.Compute/locations/eastus/vmSizeRecommendations/vmAttributeBased/generate" subscriptionID="sub-123" location="eastus" clientRequestID="c68fbda3-fc22-400f-abd0-e15b4f7af56f" latencyMs=2

With this change the scheduler would be pending until the errors return other errors/status codes that we don't deem transient while it retries.

status:
  conditions:
  - lastTransitionTime: "2026-06-24T22:49:33Z"
    message: Scheduling has not completed
    observedGeneration: 4
    reason: SchedulePending
    status: Unknown
    type: ClusterResourcePlacementScheduled

Open Question: Do we want to keep the status as not complete/pending?

  • I think we should update the status for these specific scenarios (http errors from trying to query the capacity) so that the user has more of an idea of what is happening without having to sift through the logs.

Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
…detect transient HTTP error

Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
@britaniar britaniar marked this pull request as ready for review June 24, 2026 23:58
Comment thread logs.txt Outdated
Comment thread pkg/scheduler/framework/framework.go Outdated
@michaelawyu

Copy link
Copy Markdown
Contributor

For the open question, I did a brief check on the flow and it might require some changes to do so as the loop would end as it is right now given the error before it gets to update the (scheduling policy snapshot) status (i.e., if the plugins fail, the scheduling cycle ends right away with an error). We either have to update the status as part of the error handling logic, or refactor the flow so that status update would always happen despite errors at the end of each scheduling cycle. An alternative would be to emit an event, which is slightly easier to do.

I agree 100% that it is better UX to let user know about the errors of course -> Britania, if you don't mind me asking, which way do you think works best?

britaniar and others added 3 commits June 25, 2026 10:17
Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
@britaniar

Copy link
Copy Markdown
Contributor Author

I think at the moment emitting an event is good. I will add that. If we decide to update the status, we can do that at a later time probably as we get more usage.

Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
michaelawyu
michaelawyu previously approved these changes Jun 25, 2026

@michaelawyu michaelawyu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a naming nit; otherwise LGTM

Comment thread pkg/utils/errors/errors.go Outdated
Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>

@michaelawyu michaelawyu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Britania 🙏

@britaniar britaniar merged commit 1b60c9b into Azure:main Jun 26, 2026
23 of 25 checks passed
@britaniar britaniar deleted the fixCapacityErrors branch June 26, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants