Skip to content

Commit cb1831a

Browse files
rostedtgregkh
authored andcommitted
sched/rt: Simplify the IPI based RT balancing logic
commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream. When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU. When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit: b6366f0 ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer from a large number of IPIs being sent to a single CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, when I tested this on a 120 CPU box, with a stress test that had lots of RT tasks scheduling on all CPUs, it actually triggered the hard lockup detector! One CPU had so many IPIs sent to it, and due to the restart mechanism that is triggered when the source run queue has a priority status change, the CPU spent minutes! processing the IPIs. Thinking about this further, I realized there's no reason for each run queue to send its own IPI. As all CPUs with overloaded tasks must be scanned regardless if there's one or many CPUs lowering their priority, because there's no current way to find the CPU with the highest priority task that can schedule to one of these CPUs, there really only needs to be one IPI being sent around at a time. This greatly simplifies the code! The new approach is to have each root domain have its own irq work, as the rto_mask is per root domain. The root domain has the following fields attached to it: rto_push_work - the irq work to process each CPU set in rto_mask rto_lock - the lock to protect some of the other rto fields rto_loop_start - an atomic that keeps contention down on rto_lock the first CPU scheduling in a lower priority task is the one to kick off the process. rto_loop_next - an atomic that gets incremented for each CPU that schedules in a lower priority task. rto_loop - a variable protected by rto_lock that is used to compare against rto_loop_next rto_cpu - The cpu to send the next IPI to, also protected by the rto_lock. When a CPU schedules in a lower priority task and wants to make sure overloaded CPUs know about it. It increments the rto_loop_next. Then it atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", then it is done, as another CPU is kicking off the IPI loop. If the old value is "0", then it will take the rto_lock to synchronize with a possible IPI being sent around to the overloaded CPUs. If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no IPI being sent around, or one is about to finish. Then rto_cpu is set to the first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set in rto_mask, then there's nothing to be done. When the CPU receives the IPI, it will first try to push any RT tasks that is queued on the CPU but can't run because a higher priority RT task is currently running on that CPU. Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it finds one, it simply sends an IPI to that CPU and the process continues. If there's no more CPUs in the rto_mask, then rto_loop is compared with rto_loop_next. If they match, everything is done and the process is over. If they do not match, then a CPU scheduled in a lower priority task as the IPI was being passed around, and the process needs to start again. The first CPU in rto_mask is sent the IPI. This change removes this duplication of work in the IPI logic, and greatly lowers the latency caused by the IPIs. This removed the lockup happening on the 120 CPU machine. It also simplifies the code tremendously. What else could anyone ask for? Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and supplying me with the rto_start_trylock() and rto_start_unlock() helper functions. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent 5a11b84 commit cb1831a

3 files changed

Lines changed: 138 additions & 127 deletions

File tree

kernel/sched/core.c

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5907,6 +5907,12 @@ static int init_rootdomain(struct root_domain *rd)
59075907
if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
59085908
goto free_dlo_mask;
59095909

5910+
#ifdef HAVE_RT_PUSH_IPI
5911+
rd->rto_cpu = -1;
5912+
raw_spin_lock_init(&rd->rto_lock);
5913+
init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
5914+
#endif
5915+
59105916
init_dl_bw(&rd->dl_bw);
59115917
if (cpudl_init(&rd->cpudl) != 0)
59125918
goto free_dlo_mask;

kernel/sched/rt.c

Lines changed: 115 additions & 120 deletions
Original file line numberDiff line numberDiff line change
@@ -64,10 +64,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
6464
raw_spin_unlock(&rt_b->rt_runtime_lock);
6565
}
6666

67-
#if defined(CONFIG_SMP) && defined(HAVE_RT_PUSH_IPI)
68-
static void push_irq_work_func(struct irq_work *work);
69-
#endif
70-
7167
void init_rt_rq(struct rt_rq *rt_rq)
7268
{
7369
struct rt_prio_array *array;
@@ -87,13 +83,6 @@ void init_rt_rq(struct rt_rq *rt_rq)
8783
rt_rq->rt_nr_migratory = 0;
8884
rt_rq->overloaded = 0;
8985
plist_head_init(&rt_rq->pushable_tasks);
90-
91-
#ifdef HAVE_RT_PUSH_IPI
92-
rt_rq->push_flags = 0;
93-
rt_rq->push_cpu = nr_cpu_ids;
94-
raw_spin_lock_init(&rt_rq->push_lock);
95-
init_irq_work(&rt_rq->push_work, push_irq_work_func);
96-
#endif
9786
#endif /* CONFIG_SMP */
9887
/* We start is dequeued state, because no RT tasks are queued */
9988
rt_rq->rt_queued = 0;
@@ -1802,160 +1791,166 @@ static void push_rt_tasks(struct rq *rq)
18021791
}
18031792

18041793
#ifdef HAVE_RT_PUSH_IPI
1794+
18051795
/*
1806-
* The search for the next cpu always starts at rq->cpu and ends
1807-
* when we reach rq->cpu again. It will never return rq->cpu.
1808-
* This returns the next cpu to check, or nr_cpu_ids if the loop
1809-
* is complete.
1796+
* When a high priority task schedules out from a CPU and a lower priority
1797+
* task is scheduled in, a check is made to see if there's any RT tasks
1798+
* on other CPUs that are waiting to run because a higher priority RT task
1799+
* is currently running on its CPU. In this case, the CPU with multiple RT
1800+
* tasks queued on it (overloaded) needs to be notified that a CPU has opened
1801+
* up that may be able to run one of its non-running queued RT tasks.
1802+
*
1803+
* All CPUs with overloaded RT tasks need to be notified as there is currently
1804+
* no way to know which of these CPUs have the highest priority task waiting
1805+
* to run. Instead of trying to take a spinlock on each of these CPUs,
1806+
* which has shown to cause large latency when done on machines with many
1807+
* CPUs, sending an IPI to the CPUs to have them push off the overloaded
1808+
* RT tasks waiting to run.
1809+
*
1810+
* Just sending an IPI to each of the CPUs is also an issue, as on large
1811+
* count CPU machines, this can cause an IPI storm on a CPU, especially
1812+
* if its the only CPU with multiple RT tasks queued, and a large number
1813+
* of CPUs scheduling a lower priority task at the same time.
1814+
*
1815+
* Each root domain has its own irq work function that can iterate over
1816+
* all CPUs with RT overloaded tasks. Since all CPUs with overloaded RT
1817+
* tassk must be checked if there's one or many CPUs that are lowering
1818+
* their priority, there's a single irq work iterator that will try to
1819+
* push off RT tasks that are waiting to run.
1820+
*
1821+
* When a CPU schedules a lower priority task, it will kick off the
1822+
* irq work iterator that will jump to each CPU with overloaded RT tasks.
1823+
* As it only takes the first CPU that schedules a lower priority task
1824+
* to start the process, the rto_start variable is incremented and if
1825+
* the atomic result is one, then that CPU will try to take the rto_lock.
1826+
* This prevents high contention on the lock as the process handles all
1827+
* CPUs scheduling lower priority tasks.
1828+
*
1829+
* All CPUs that are scheduling a lower priority task will increment the
1830+
* rt_loop_next variable. This will make sure that the irq work iterator
1831+
* checks all RT overloaded CPUs whenever a CPU schedules a new lower
1832+
* priority task, even if the iterator is in the middle of a scan. Incrementing
1833+
* the rt_loop_next will cause the iterator to perform another scan.
18101834
*
1811-
* rq->rt.push_cpu holds the last cpu returned by this function,
1812-
* or if this is the first instance, it must hold rq->cpu.
18131835
*/
18141836
static int rto_next_cpu(struct rq *rq)
18151837
{
1816-
int prev_cpu = rq->rt.push_cpu;
1838+
struct root_domain *rd = rq->rd;
1839+
int next;
18171840
int cpu;
18181841

1819-
cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
1820-
18211842
/*
1822-
* If the previous cpu is less than the rq's CPU, then it already
1823-
* passed the end of the mask, and has started from the beginning.
1824-
* We end if the next CPU is greater or equal to rq's CPU.
1843+
* When starting the IPI RT pushing, the rto_cpu is set to -1,
1844+
* rt_next_cpu() will simply return the first CPU found in
1845+
* the rto_mask.
1846+
*
1847+
* If rto_next_cpu() is called with rto_cpu is a valid cpu, it
1848+
* will return the next CPU found in the rto_mask.
1849+
*
1850+
* If there are no more CPUs left in the rto_mask, then a check is made
1851+
* against rto_loop and rto_loop_next. rto_loop is only updated with
1852+
* the rto_lock held, but any CPU may increment the rto_loop_next
1853+
* without any locking.
18251854
*/
1826-
if (prev_cpu < rq->cpu) {
1827-
if (cpu >= rq->cpu)
1828-
return nr_cpu_ids;
1855+
for (;;) {
18291856

1830-
} else if (cpu >= nr_cpu_ids) {
1831-
/*
1832-
* We passed the end of the mask, start at the beginning.
1833-
* If the result is greater or equal to the rq's CPU, then
1834-
* the loop is finished.
1835-
*/
1836-
cpu = cpumask_first(rq->rd->rto_mask);
1837-
if (cpu >= rq->cpu)
1838-
return nr_cpu_ids;
1839-
}
1840-
rq->rt.push_cpu = cpu;
1857+
/* When rto_cpu is -1 this acts like cpumask_first() */
1858+
cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
18411859

1842-
/* Return cpu to let the caller know if the loop is finished or not */
1843-
return cpu;
1844-
}
1860+
rd->rto_cpu = cpu;
18451861

1846-
static int find_next_push_cpu(struct rq *rq)
1847-
{
1848-
struct rq *next_rq;
1849-
int cpu;
1862+
if (cpu < nr_cpu_ids)
1863+
return cpu;
18501864

1851-
while (1) {
1852-
cpu = rto_next_cpu(rq);
1853-
if (cpu >= nr_cpu_ids)
1854-
break;
1855-
next_rq = cpu_rq(cpu);
1865+
rd->rto_cpu = -1;
18561866

1857-
/* Make sure the next rq can push to this rq */
1858-
if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
1867+
/*
1868+
* ACQUIRE ensures we see the @rto_mask changes
1869+
* made prior to the @next value observed.
1870+
*
1871+
* Matches WMB in rt_set_overload().
1872+
*/
1873+
next = atomic_read_acquire(&rd->rto_loop_next);
1874+
1875+
if (rd->rto_loop == next)
18591876
break;
1877+
1878+
rd->rto_loop = next;
18601879
}
18611880

1862-
return cpu;
1881+
return -1;
1882+
}
1883+
1884+
static inline bool rto_start_trylock(atomic_t *v)
1885+
{
1886+
return !atomic_cmpxchg_acquire(v, 0, 1);
18631887
}
18641888

1865-
#define RT_PUSH_IPI_EXECUTING 1
1866-
#define RT_PUSH_IPI_RESTART 2
1889+
static inline void rto_start_unlock(atomic_t *v)
1890+
{
1891+
atomic_set_release(v, 0);
1892+
}
18671893

18681894
static void tell_cpu_to_push(struct rq *rq)
18691895
{
1870-
int cpu;
1896+
int cpu = -1;
18711897

1872-
if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
1873-
raw_spin_lock(&rq->rt.push_lock);
1874-
/* Make sure it's still executing */
1875-
if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
1876-
/*
1877-
* Tell the IPI to restart the loop as things have
1878-
* changed since it started.
1879-
*/
1880-
rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
1881-
raw_spin_unlock(&rq->rt.push_lock);
1882-
return;
1883-
}
1884-
raw_spin_unlock(&rq->rt.push_lock);
1885-
}
1898+
/* Keep the loop going if the IPI is currently active */
1899+
atomic_inc(&rq->rd->rto_loop_next);
18861900

1887-
/* When here, there's no IPI going around */
1888-
1889-
rq->rt.push_cpu = rq->cpu;
1890-
cpu = find_next_push_cpu(rq);
1891-
if (cpu >= nr_cpu_ids)
1901+
/* Only one CPU can initiate a loop at a time */
1902+
if (!rto_start_trylock(&rq->rd->rto_loop_start))
18921903
return;
18931904

1894-
rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
1905+
raw_spin_lock(&rq->rd->rto_lock);
1906+
1907+
/*
1908+
* The rto_cpu is updated under the lock, if it has a valid cpu
1909+
* then the IPI is still running and will continue due to the
1910+
* update to loop_next, and nothing needs to be done here.
1911+
* Otherwise it is finishing up and an ipi needs to be sent.
1912+
*/
1913+
if (rq->rd->rto_cpu < 0)
1914+
cpu = rto_next_cpu(rq);
1915+
1916+
raw_spin_unlock(&rq->rd->rto_lock);
18951917

1896-
irq_work_queue_on(&rq->rt.push_work, cpu);
1918+
rto_start_unlock(&rq->rd->rto_loop_start);
1919+
1920+
if (cpu >= 0)
1921+
irq_work_queue_on(&rq->rd->rto_push_work, cpu);
18971922
}
18981923

18991924
/* Called from hardirq context */
1900-
static void try_to_push_tasks(void *arg)
1925+
void rto_push_irq_work_func(struct irq_work *work)
19011926
{
1902-
struct rt_rq *rt_rq = arg;
1903-
struct rq *rq, *src_rq;
1904-
int this_cpu;
1927+
struct rq *rq;
19051928
int cpu;
19061929

1907-
this_cpu = rt_rq->push_cpu;
1930+
rq = this_rq();
19081931

1909-
/* Paranoid check */
1910-
BUG_ON(this_cpu != smp_processor_id());
1911-
1912-
rq = cpu_rq(this_cpu);
1913-
src_rq = rq_of_rt_rq(rt_rq);
1914-
1915-
again:
1932+
/*
1933+
* We do not need to grab the lock to check for has_pushable_tasks.
1934+
* When it gets updated, a check is made if a push is possible.
1935+
*/
19161936
if (has_pushable_tasks(rq)) {
19171937
raw_spin_lock(&rq->lock);
1918-
push_rt_task(rq);
1938+
push_rt_tasks(rq);
19191939
raw_spin_unlock(&rq->lock);
19201940
}
19211941

1922-
/* Pass the IPI to the next rt overloaded queue */
1923-
raw_spin_lock(&rt_rq->push_lock);
1924-
/*
1925-
* If the source queue changed since the IPI went out,
1926-
* we need to restart the search from that CPU again.
1927-
*/
1928-
if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
1929-
rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
1930-
rt_rq->push_cpu = src_rq->cpu;
1931-
}
1942+
raw_spin_lock(&rq->rd->rto_lock);
19321943

1933-
cpu = find_next_push_cpu(src_rq);
1944+
/* Pass the IPI to the next rt overloaded queue */
1945+
cpu = rto_next_cpu(rq);
19341946

1935-
if (cpu >= nr_cpu_ids)
1936-
rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
1937-
raw_spin_unlock(&rt_rq->push_lock);
1947+
raw_spin_unlock(&rq->rd->rto_lock);
19381948

1939-
if (cpu >= nr_cpu_ids)
1949+
if (cpu < 0)
19401950
return;
19411951

1942-
/*
1943-
* It is possible that a restart caused this CPU to be
1944-
* chosen again. Don't bother with an IPI, just see if we
1945-
* have more to push.
1946-
*/
1947-
if (unlikely(cpu == rq->cpu))
1948-
goto again;
1949-
19501952
/* Try the next RT overloaded CPU */
1951-
irq_work_queue_on(&rt_rq->push_work, cpu);
1952-
}
1953-
1954-
static void push_irq_work_func(struct irq_work *work)
1955-
{
1956-
struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
1957-
1958-
try_to_push_tasks(rt_rq);
1953+
irq_work_queue_on(&rq->rd->rto_push_work, cpu);
19591954
}
19601955
#endif /* HAVE_RT_PUSH_IPI */
19611956

kernel/sched/sched.h

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -429,7 +429,7 @@ static inline int rt_bandwidth_enabled(void)
429429
}
430430

431431
/* RT IPI pull logic requires IRQ_WORK */
432-
#ifdef CONFIG_IRQ_WORK
432+
#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP)
433433
# define HAVE_RT_PUSH_IPI
434434
#endif
435435

@@ -450,12 +450,6 @@ struct rt_rq {
450450
unsigned long rt_nr_total;
451451
int overloaded;
452452
struct plist_head pushable_tasks;
453-
#ifdef HAVE_RT_PUSH_IPI
454-
int push_flags;
455-
int push_cpu;
456-
struct irq_work push_work;
457-
raw_spinlock_t push_lock;
458-
#endif
459453
#endif /* CONFIG_SMP */
460454
int rt_queued;
461455

@@ -537,6 +531,19 @@ struct root_domain {
537531
struct dl_bw dl_bw;
538532
struct cpudl cpudl;
539533

534+
#ifdef HAVE_RT_PUSH_IPI
535+
/*
536+
* For IPI pull requests, loop across the rto_mask.
537+
*/
538+
struct irq_work rto_push_work;
539+
raw_spinlock_t rto_lock;
540+
/* These are only updated and read within rto_lock */
541+
int rto_loop;
542+
int rto_cpu;
543+
/* These atomics are updated outside of a lock */
544+
atomic_t rto_loop_next;
545+
atomic_t rto_loop_start;
546+
#endif
540547
/*
541548
* The "RT overload" flag: it gets set if a CPU has more than
542549
* one runnable RT task.
@@ -547,6 +554,9 @@ struct root_domain {
547554

548555
extern struct root_domain def_root_domain;
549556

557+
#ifdef HAVE_RT_PUSH_IPI
558+
extern void rto_push_irq_work_func(struct irq_work *work);
559+
#endif
550560
#endif /* CONFIG_SMP */
551561

552562
/*

0 commit comments

Comments
 (0)