Skip to content

Commit 3144d81

Browse files
htejungregkh
authored andcommitted
cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
commit 77f88796cee819b9c4562b0b6b44691b3b7755b1 upstream. Creation of a kthread goes through a couple interlocked stages between the kthread itself and its creator. Once the new kthread starts running, it initializes itself and wakes up the creator. The creator then can further configure the kthread and then let it start doing its job by waking it up. In this configuration-by-creator stage, the creator is the only one that can wake it up but the kthread is visible to userland. When altering the kthread's attributes from userland is allowed, this is fine; however, for cases where CPU affinity is critical, kthread_bind() is used to first disable affinity changes from userland and then set the affinity. This also prevents the kthread from being migrated into non-root cgroups as that can affect the CPU affinity and many other things. Unfortunately, the cgroup side of protection is racy. While the PF_NO_SETAFFINITY flag prevents further migrations, userland can win the race before the creator sets the flag with kthread_bind() and put the kthread in a non-root cgroup, which can lead to all sorts of problems including incorrect CPU affinity and starvation. This bug got triggered by userland which periodically tries to migrate all processes in the root cpuset cgroup to a non-root one. Per-cpu workqueue workers got caught while being created and ended up with incorrected CPU affinity breaking concurrency management and sometimes stalling workqueue execution. This patch adds task->no_cgroup_migration which disallows the task to be migrated by userland. kthreadd starts with the flag set making every child kthread start in the root cgroup with migration disallowed. The flag is cleared after the kthread finishes initialization by which time PF_NO_SETAFFINITY is set if the kthread should stay in the root cgroup. It'd be better to wait for the initialization instead of failing but I couldn't think of a way of implementing that without adding either a new PF flag, or sleeping and retrying from waiting side. Even if userland depends on changing cgroup membership of a kthread, it either has to be synchronized with kthread_create() or periodically repeat, so it's unlikely that this would break anything. v2: Switch to a simpler implementation using a new task_struct bit field suggested by Oleg. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-debugged-by: Chris Mason <clm@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent a80c068 commit 3144d81

4 files changed

Lines changed: 33 additions & 4 deletions

File tree

include/linux/cgroup.h

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -528,6 +528,25 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
528528
pr_cont_kernfs_path(cgrp->kn);
529529
}
530530

531+
static inline void cgroup_init_kthreadd(void)
532+
{
533+
/*
534+
* kthreadd is inherited by all kthreads, keep it in the root so
535+
* that the new kthreads are guaranteed to stay in the root until
536+
* initialization is finished.
537+
*/
538+
current->no_cgroup_migration = 1;
539+
}
540+
541+
static inline void cgroup_kthread_ready(void)
542+
{
543+
/*
544+
* This kthread finished initialization. The creator should have
545+
* set PF_NO_SETAFFINITY if this kthread should stay in the root.
546+
*/
547+
current->no_cgroup_migration = 0;
548+
}
549+
531550
#else /* !CONFIG_CGROUPS */
532551

533552
struct cgroup_subsys_state;
@@ -551,6 +570,8 @@ static inline void cgroup_free(struct task_struct *p) {}
551570

552571
static inline int cgroup_init_early(void) { return 0; }
553572
static inline int cgroup_init(void) { return 0; }
573+
static inline void cgroup_init_kthreadd(void) {}
574+
static inline void cgroup_kthread_ready(void) {}
554575

555576
#endif /* !CONFIG_CGROUPS */
556577

include/linux/sched.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1475,6 +1475,10 @@ struct task_struct {
14751475
#ifdef CONFIG_COMPAT_BRK
14761476
unsigned brk_randomized:1;
14771477
#endif
1478+
#ifdef CONFIG_CGROUPS
1479+
/* disallow userland-initiated cgroup migration */
1480+
unsigned no_cgroup_migration:1;
1481+
#endif
14781482

14791483
unsigned long atomic_flags; /* Flags needing atomic access. */
14801484

kernel/cgroup.c

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2752,11 +2752,12 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
27522752
tsk = tsk->group_leader;
27532753

27542754
/*
2755-
* Workqueue threads may acquire PF_NO_SETAFFINITY and become
2756-
* trapped in a cpuset, or RT worker may be born in a cgroup
2757-
* with no rt_runtime allocated. Just say no.
2755+
* kthreads may acquire PF_NO_SETAFFINITY during initialization.
2756+
* If userland migrates such a kthread to a non-root cgroup, it can
2757+
* become trapped in a cpuset, or RT kthread may be born in a
2758+
* cgroup with no rt_runtime allocated. Just say no.
27582759
*/
2759-
if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
2760+
if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
27602761
ret = -EINVAL;
27612762
goto out_unlock_rcu;
27622763
}

kernel/kthread.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
#include <linux/freezer.h>
1919
#include <linux/ptrace.h>
2020
#include <linux/uaccess.h>
21+
#include <linux/cgroup.h>
2122
#include <trace/events/sched.h>
2223

2324
static DEFINE_SPINLOCK(kthread_create_lock);
@@ -205,6 +206,7 @@ static int kthread(void *_create)
205206
ret = -EINTR;
206207

207208
if (!test_bit(KTHREAD_SHOULD_STOP, &self.flags)) {
209+
cgroup_kthread_ready();
208210
__kthread_parkme(&self);
209211
ret = threadfn(data);
210212
}
@@ -510,6 +512,7 @@ int kthreadd(void *unused)
510512
set_mems_allowed(node_states[N_MEMORY]);
511513

512514
current->flags |= PF_NOFREEZE;
515+
cgroup_init_kthreadd();
513516

514517
for (;;) {
515518
set_current_state(TASK_INTERRUPTIBLE);

0 commit comments

Comments
 (0)