WIP: NRT: cache: fix assumedResource drift by ffromani · Pull Request #958 · kubernetes-sigs/scheduler-plugins

ffromani · 2026-04-17T15:05:32Z

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

Prevent entries into cache's assumedResources to be added indefinitely, and remove interaction between Resync and Reserve that also may cause assumedResources to be initialized and grow, plus it may drift and don't be resynced correctly

Which issue(s) this PR fixes:

Fixes #N/A

Special notes for your reviewer:

TBD

Does this PR introduce a user-facing change?

Fix noderesourcetopology cache drift caused by TOCTOU race with its resync loop

Previously, when a node was marked dirty, a successful Reserve() would have marked it clean. The reasoning was that if a node gets selected - thus Reserve() called upon it, then obviously the overallocation was wrong and the node could have run more workload. The problem with this approach is that it compounds badly with Resync(). Resync() is a complex long operation which needs to do many steps and call multiple APIs, usually through client-go cache but still. For that reason, Resync() only holds the OverReserve cache lock for intervals, not continuously - that would be both dangerous AND critically slow. It's possible that Reserve() happens in between a Resync() cycle, because Resync() is, and must be, asynchronous wrt the scheduling cycles. So, Reserve() can clear the dirty node flag and corrupt the internal state. If that happens, unflushed nodes become effectively invisible to future resync cycles. The assumedResources become stuck (never flushed) on the affected nodes. The safest fix is remove the dirty node clearing shortcut in Reserve(). This would cause a node to stay dirty longer, but makes the scheduler plugin operating more safely and predictably. With this change, a node may stay dirty for one extra resync cycle, but this ensures dirty flag and stale data are always resolved together. Signed-off-by: Francesco Romani <fromani@redhat.com>

ReserveNodeResources() was unconditionally tracking assumedResources regardless if a node actually had NRT data or not. This caused `assumedResources` to grow unbounded in these cases. The reason why the data is never cleared is that all the paths that mark a node for resync (NodeMaybeOverReserved, NodeHasForeignPods, nodesWithAttrUpdate) require the NRT to already be in the store, otherwise the assumed resources are never flushed. Nodes however are not expected to lack cached NRT data. But this can happen in practice, and the filter code actually already accounts for this case and always allows the nodes to be used. Furthermore, if NRT objects are *created* after the scheduler starts, which is not recommended but can happen, we have a time window on which nodes don't have corresponding NRT data. The fix is simple: exit early in ReserveNodeResources() if we process a node which doesn't have NRT data. Signed-off-by: Francesco Romani <fromani@redhat.com>

Signed-off-by: Francesco Romani <fromani@redhat.com>

add thread safe test helpers. Don't expose, or allow access, to internal data unguarded. Even though we're *probably* safe in the controlled test environment, not worth the intrinsic risk. Signed-off-by: Francesco Romani <fromani@redhat.com>

k8s-ci-robot · 2026-04-17T15:05:34Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

netlify · 2026-04-17T15:05:38Z

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name	Link
🔨 Latest commit	`7474d68`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-scheduler-plugins/deploys/69e24c3f2a3e2100083aba0f

k8s-ci-robot · 2026-04-17T15:05:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ffromani added 4 commits April 17, 2026 16:38

nrt: cache: store: fix typo

3343be8

Signed-off-by: Francesco Romani <fromani@redhat.com>

nrt: cache: add test helpers

7474d68

add thread safe test helpers. Don't expose, or allow access, to internal data unguarded. Even though we're *probably* safe in the controlled test environment, not worth the intrinsic risk. Signed-off-by: Francesco Romani <fromani@redhat.com>

k8s-ci-robot requested review from PiotrProkop and Tal-or April 17, 2026 15:05

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: NRT: cache: fix assumedResource drift#958

WIP: NRT: cache: fix assumedResource drift#958
ffromani wants to merge 4 commits intokubernetes-sigs:masterfrom
ffromani:assumed-res-unbounded

ffromani commented Apr 17, 2026

Uh oh!

k8s-ci-robot commented Apr 17, 2026

Uh oh!

netlify bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ffromani commented Apr 17, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Apr 17, 2026

Uh oh!

netlify bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Uh oh!

k8s-ci-robot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify bot commented Apr 17, 2026 •

edited

Loading