Skip to content

[Proposal] Spatio-temporally-aware Scheduler #955

@ccnn-pcl

Description

@ccnn-pcl

Issues

The convergence of artificial intelligence as a dominant cloud workload and the pervasive adoption of the cloud native paradigm are poised to make computing power a ubiquitous utility. However, by treating jobs as black boxes, existing schedulers fail to adapt to the distinct traffic patterns of periodic jobs, as exemplified by distributed training scenarios. In this scenario, each iteration consists of a local computation phase (e.g., forward pass) and a synchronization phase that generates a large amount of traffic between workers. Furthermore, the interdependent nature of these computation and communication phases gives rise to a well-defined, periodic “on-off” communication pattern.

Image

A concrete example of this mismatch is shown in the left subfigure of the figure above, where four distributed training jobs share a link without spatio-temporally aware scheduling, leading to competition for bandwidth resources. The resulting communication contention causes delayed flows, which stall the subsequent computations, thereby slowing down the entire training process and reducing link utilization. Currently, network-aware schedulers guarantee training performance by employing exclusive allocation of bandwidth resources (middle subfigure of the figure above). This model strictly constrains the allocatable bandwidth on a link, which means the allocatable amount is derived solely by subtracting the bandwidth consumed by existing jobs from the link’s predefined total capacity. However, these existing approaches lead to job rejection when bandwidth is insufficient.

Possible solutions

We tackle the fundamental problem by proposing a spatio-temporally aware scheduling mechanism designed to support jobs that exhibit periodic traffic patterns. By doing so, we can ultimately achieve time-division multiplexing of network bandwidth (right subfigure of the figure above), overcoming the limitations of static allocation.

Scopes

The proposal introduces four Custom Resources (CRs) defined as Custom Resource Definitions (CRDs). Among these, AppGroup and NetworkTopology reuse the design from the network-aware scheduler plugin to manage application dependencies and infrastructure costs. In addition, STBandwidthCapacity and STBandwidthClaim are introduced to define the spatio-temporal profiles for node bandwidth capacity and pod bandwidth requirements, respectively.

As an initial design, we plan to implement a multi-point plugin named Metronome (implemented as PeriodicTrafficFit). It is designed across the following extension points to optimize periodic traffic scheduling:

  • PreFilter: Performs metadata pre-fetching by retrieving STBandwidthClaim and STBandwidthCapacity specifications. These are cached to minimize redundant API overhead in subsequent phases.
  • Filter: Implements a feasibility check that filters out nodes whose bandwidth capacity cannot support the pod’s required bandwidth magnitude.
  • Score: Executes a Contention-Minimizing algorithm by simulating various temporal offsets (rotations). It calculates the optimal phase alignment between the new pod and existing periodic workloads to minimize bandwidth peak overlaps and resource contention.
  • Normalize Score: Acts as a secondary refinement layer. If the bandwidth fitting algorithm in the Score phase yields identical top scores for multiple candidate nodes, this phase introduces a network-aware tie-breaking mechanism. This phase reuses the architectural design and logic of the network-aware plugin, while incorporating significant logic enhancements.
  • Reserve & Unreserve: These phases maintain resource integrity and consistency through CR management.
Image

Details

For the comprehensive design details, as well as extensive experimental evaluations, please refer to our full paper available on arXiv.

Additionally, the project is open-source and hosted on GitHub.

A detailed description of the plugin implementation can also be found in our Google Doc.


We highly welcome community feedback and suggestions; please feel free to leave comments in the Google Doc or participate in the discussion within this GitHub issue.

Contact Team

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions