Commit Graph

1320 Commits

Author SHA1 Message Date
Balbir Singh
a37a538e82 sched/core: Fix a race between try_to_wake_up() and a woken up task
commit 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf upstream.

The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

	do {
		schedule()
		set_current_state(TASK_(UN)INTERRUPTIBLE);
	} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

	while (p->on_cpu)
		cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1					CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
					wakeup_routine()
					  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
 }					  try_to_wake_up()
 set_current_state(TASK_RUNNING);	  ..
 list_del(&waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p->pi_lock)
   ..
   if (p->on_rq && ttwu_wakeup())
   ..
   while (p->on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24 10:07:42 +02:00
Gavin Guo
f2b8424f35 sched/numa: Fix use-after-free bug in the task_numa_compare
[ Upstream commit 1dff76b92f69051e579bdc131e01500da9fa2a91 ]

The following message can be observed on the Ubuntu v3.13.0-65 with KASan
backported:

  ==================================================================
  BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
  Read of size 8 by task qemu-system-x86/3998900
  =============================================================================
  BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
  -----------------------------------------------------------------------------

  INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
	__slab_alloc+0x4f8/0x560
	__kmalloc+0x1eb/0x280
	task_numa_fault+0xc1b/0xed0
	do_numa_page+0x192/0x200
	handle_mm_fault+0x808/0x1160
	__do_page_fault+0x218/0x750
	do_page_fault+0x1a/0x70
	page_fault+0x28/0x30
	SyS_poll+0x66/0x1a0
	system_call_fastpath+0x1a/0x1f
  INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
	__slab_free+0x2ab/0x3f0
	kfree+0x161/0x170
	task_numa_free+0x1d2/0x200
	finish_task_switch+0x1d2/0x210
	__schedule+0x5d4/0xc60
	schedule_preempt_disabled+0x40/0xc0
	cpu_startup_entry+0x2da/0x340
	start_secondary+0x28f/0x360
  Call Trace:
   [<ffffffff81a6ce35>] dump_stack+0x45/0x56
   [<ffffffff81244aed>] print_trailer+0xfd/0x170
   [<ffffffff8124ac36>] object_err+0x36/0x40
   [<ffffffff8124cbf9>] kasan_report_error+0x1e9/0x3a0
   [<ffffffff8124d260>] kasan_report+0x40/0x50
   [<ffffffff810dda7c>] ? task_numa_find_cpu+0x64c/0x890
   [<ffffffff8124bee9>] __asan_load8+0x69/0xa0
   [<ffffffff814f5c38>] ? find_next_bit+0xd8/0x120
   [<ffffffff810dda7c>] task_numa_find_cpu+0x64c/0x890
   [<ffffffff810de16c>] task_numa_migrate+0x4ac/0x7b0
   [<ffffffff810de523>] numa_migrate_preferred+0xb3/0xc0
   [<ffffffff810e0b88>] task_numa_fault+0xb88/0xed0
   [<ffffffff8120ef02>] do_numa_page+0x192/0x200
   [<ffffffff81211038>] handle_mm_fault+0x808/0x1160
   [<ffffffff810d7dbd>] ? sched_clock_cpu+0x10d/0x160
   [<ffffffff81068c52>] ? native_load_tls+0x82/0xa0
   [<ffffffff81a7bd68>] __do_page_fault+0x218/0x750
   [<ffffffff810c2186>] ? hrtimer_try_to_cancel+0x76/0x160
   [<ffffffff81a6f5e7>] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
   [<ffffffff81a7c2ba>] do_page_fault+0x1a/0x70
   [<ffffffff81a772e8>] page_fault+0x28/0x30
   [<ffffffff8128cbd4>] ? do_sys_poll+0x1c4/0x6d0
   [<ffffffff810e64f6>] ? enqueue_task_fair+0x4b6/0xaa0
   [<ffffffff810233c9>] ? sched_clock+0x9/0x10
   [<ffffffff810cf70a>] ? resched_task+0x7a/0xc0
   [<ffffffff810d0663>] ? check_preempt_curr+0xb3/0x130
   [<ffffffff8128b5c0>] ? poll_select_copy_remaining+0x170/0x170
   [<ffffffff810d3bc0>] ? wake_up_state+0x10/0x20
   [<ffffffff8112a28f>] ? drop_futex_key_refs.isra.14+0x1f/0x90
   [<ffffffff8112d40e>] ? futex_requeue+0x3de/0xba0
   [<ffffffff8112e49e>] ? do_futex+0xbe/0x8f0
   [<ffffffff81022c89>] ? read_tsc+0x9/0x20
   [<ffffffff8111bd9d>] ? ktime_get_ts+0x12d/0x170
   [<ffffffff8108f699>] ? timespec_add_safe+0x59/0xe0
   [<ffffffff8128d1f6>] SyS_poll+0x66/0x1a0
   [<ffffffff81a830dd>] system_call_fastpath+0x1a/0x1f

As commit 1effd9f193 ("sched/numa: Fix unsafe get_task_struct() in
task_numa_assign()") points out, the rcu_read_lock() cannot protect the
task_struct from being freed in the finish_task_switch(). And the bug
happens in the process of calculation of imp which requires the access of
p->numa_faults being freed in the following path:

do_exit()
        current->flags |= PF_EXITING;
    release_task()
        ~~delayed_put_task_struct()~~
    schedule()
    ...
    ...
rq->curr = next;
    context_switch()
        finish_task_switch()
            put_task_struct()
                __put_task_struct()
		    task_numa_free()

The fix here to get_task_struct() early before end of dst_rq->lock to
protect the calculation process and also put_task_struct() in the
corresponding point if finally the dst_rq->curr somehow cannot be
assigned.

Additional credit to Liang Chen who helped fix the error logic and add the
put_task_struct() to the place it missed.

Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jay.vosburgh@canonical.com
Cc: liang.chen@canonical.com
Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-15 08:27:45 +02:00
Wanpeng Li
15abaa07a2 sched/nohz: Fix affine unpinned timers mess
commit 444969223c81c7d0a95136b7b4cfdcfbc96ac5bd upstream.

The following commit:

  9642d18eee ("nohz: Affine unpinned timers to housekeepers")'

intended to affine unpinned timers to housekeepers:

  unpinned timers(full dynaticks, idle)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(full dynaticks, busy)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(houserkeepers, idle)    =>   nearest busy housekeepers(otherwise, fallback to itself)

However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the
intention to:

  unpinned timers(full dynaticks, idle)   =>   any housekeepers(no mattter cpu topology)
  unpinned timers(full dynaticks, busy)   =>   any housekeepers(no mattter cpu topology)
  unpinned timers(housekeepers, idle)     =>   any busy cpus(otherwise, fallback to any housekeepers)

This patch fixes it by checking if there are busy housekeepers nearby,
otherwise falls to any housekeepers/itself. After the patch:

  unpinned timers(full dynaticks, idle)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(full dynaticks, busy)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(housekeepers, idle)     =>   nearest busy housekeepers(otherwise, fallback to itself)

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Fixed the changelog. ]
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 'commit 9642d18eee ("nohz: Affine unpinned timers to housekeepers")'
Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-07 08:32:41 +02:00
Peter Zijlstra
c3cf68ec55 sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression
commit 173be9a14f7b2e901cf77c18b1aafd4d672e9d9e upstream.

Mike reports:

 Roughly 10% of the time, ltp testcase getrusage04 fails:
 getrusage04    0  TINFO  :  Expected timers granularity is 4000 us
 getrusage04    0  TINFO  :  Using 1 as multiply factor for max [us]time increment (1000+4000us)!
 getrusage04    0  TINFO  :  utime:           0us; stime:         179us
 getrusage04    0  TINFO  :  utime:        3751us; stime:           0us
 getrusage04    1  TFAIL  :  getrusage04.c:133: stime increased > 5000us:

And tracked it down to the case where the task simply doesn't get
_any_ [us]time ticks.

Update the code to assume all rtime is utime when we lack information,
thus ensuring a task that elides the tick gets time accounted.

Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fredrik Markstrom <fredrik.markstrom@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Fixes: 9d7fb04276 ("sched/cputime: Guarantee stime + utime == rtime")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-07 08:32:41 +02:00
Peter Zijlstra
34bf12312b sched/fair: Fix effective_load() to consistently use smoothed load
commit 7dd4912594daf769a46744848b05bd5bc6d62469 upstream.

Starting with the following commit:

  fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")

calc_tg_weight() doesn't compute the right value as expected by effective_load().

The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.

Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.

So stop using calc_tg_weight() and do it explicitly.

The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10 11:49:28 +02:00
Andrey Ryabinin
dc20f3244a kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w
commit 57675cb976eff977aefb428e68e4e0236d48a9ff upstream.

Lengthy output of sysrq-w may take a lot of time on slow serial console.

Currently we reset NMI-watchdog on the current CPU to avoid spurious
lockup messages. Sometimes this doesn't work since softlockup watchdog
might trigger on another CPU which is waiting for an IPI to proceed.
We reset softlockup watchdogs on all CPUs, but we do this only after
listing all tasks, and this may be too late on a busy system.

So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10 11:49:25 +02:00
Peter Zijlstra
43b1bfec0e sched/fair: Fix cfs_rq avg tracking underflow
commit 8974189222159154c55f24ddad33e3613960521a upstream.

As per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
> 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> 		removed_load = 1;
> 	}
>
> turns into:
>
> ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
> ffffffff8108706b:       48 85 c0                test   %rax,%rax
> ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
> ffffffff81087070:       4c 89 f8                mov    %r15,%rax
> ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
> ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
> ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
> ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
> ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
> ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.

So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
       maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27 09:47:31 -07:00
Jann Horn
c08b1a593a sched: panic on corrupted stack end
commit 29d6455178a09e1dc340380c582b13356227e8df upstream.

Until now, hitting this BUG_ON caused a recursive oops (because oops
handling involves do_exit(), which calls into the scheduler, which in
turn raises an oops), which caused stuff below the stack to be
overwritten until a panic happened (e.g.  via an oops in interrupt
context, caused by the overwritten CPU index in the thread_info).

Just panic directly.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24 10:18:20 -07:00
Vik Heyndrickx
1df73f1884 sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems
commit 20878232c52329f92423d27a60e48b6a6389e0dd upstream.

Systems show a minimal load average of 0.00, 0.01, 0.05 even when they
have no load at all.

Uptime and /proc/loadavg on all systems with kernels released during the
last five years up until kernel version 4.6-rc5, show a 5- and 15-minute
minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on
idle systems, but the way the kernel calculates this value prevents it
from getting lower than the mentioned values.

Likewise but not as obviously noticeable, a fully loaded system with no
processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95
(multiplied by number of cores).

Once the (old) load becomes 93 or higher, it mathematically can never
get lower than 93, even when the active (load) remains 0 forever.
This results in the strange 0.00, 0.01, 0.05 uptime values on idle
systems.  Note: 93/2048 = 0.0454..., which rounds up to 0.05.

It is not correct to add a 0.5 rounding (=1024/2048) here, since the
result from this function is fed back into the next iteration again,
so the result of that +0.5 rounding value then gets multiplied by
(2048-2037), and then rounded again, so there is a virtual "ghost"
load created, next to the old and active load terms.

By changing the way the internally kept value is rounded, that internal
value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon
increasing load, the internally kept load value is rounded up, when the
load is decreasing, the load value is rounded down.

The modified code was tested on nohz=off and nohz kernels. It was tested
on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was
tested on single, dual, and octal cores system. It was tested on virtual
hosts and bare hardware. No unwanted effects have been observed, and the
problems that the patch intended to fix were indeed gone.

Tested-by: Damien Wyart <damien.wyart@free.fr>
Signed-off-by: Vik Heyndrickx <vik.heyndrickx@veribox.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Doug Smythies <dsmythies@telus.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0f004f5a69 ("sched: Cure more NO_HZ load average woes")
Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:49 -07:00
Peter Zijlstra
c0944355a7 sched/cgroup: Fix/cleanup cgroup teardown/init
commit 2f5177f0fd7e531b26d54633be62d1d4cb94621c upstream.

The CPU controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit:

  2e91fa7f6d ("cgroup: keep zombies associated with their original cgroups")

caused it to explode.

The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.

The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.

Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().

And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.

Suggested-by: Tejun Heo <tj@kernel.org>
Reported-by: Kazuki Yamaguchi <k@rhe.jp>
Reported-by: Niklas Cassel <niklas.cassel@axis.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2e91fa7f6d ("cgroup: keep zombies associated with their original cgroups")
Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04 14:48:42 -07:00
Thomas Gleixner
2a8225ef46 sched/cputime: Fix steal time accounting vs. CPU hotplug
commit e9532e69b8d1d1284e8ecf8d2586de34aec61244 upstream.

On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
value over CPU down and up. So after the CPU comes up again the delta
calculation in steal_account_process_tick() wreckages itself due to the
unsigned math:

	 u64 steal = paravirt_steal_clock(smp_processor_id());

	 steal -= this_rq()->prev_steal_time;

So if steal is smaller than rq->prev_steal_time we end up with an insane large
value which then gets added to rq->prev_steal_time, resulting in a permanent
wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
become stale.

Nice trick to tell the world how idle the system is (100%) while the CPU is
100% busy running tasks. Though we prefer realistic numbers.

None of the accounting values which use a previous value to account for
fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
deals with clock warps and limits the /proc/stat visible wreckage. The
prev_time values are still wrong.

Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: commit 095c0aa83e "sched: adjust scheduler cpu power for stolen time"
Fixes: commit aa48380851 "sched: Remove irq time from available CPU power"
Fixes: commit e6e6685acc "KVM guest: Steal time accounting"
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:09:05 -07:00
Chris Friesen
af080e5802 sched/cputime: Fix steal_account_process_tick() to always return jiffies
commit f9c904b7613b8b4c85b10cd6b33ad41b2843fa9d upstream.

The callers of steal_account_process_tick() expect it to return
whether a jiffy should be considered stolen or not.

Currently the return value of steal_account_process_tick() is in
units of cputime, which vary between either jiffies or nsecs
depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN.

If cputime has nsecs granularity and there is a tiny amount of
stolen time (a few nsecs, say) then we will consider the entire
tick stolen and will not account the tick on user/system/idle,
causing /proc/stats to show invalid data.

The fix is to change steal_account_process_tick() to accumulate
the stolen time and only account it once it's worth a jiffy.

(Thanks to Frederic Weisbecker for suggestions to fix a bug in my
first version of the patch.)

Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.ca
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:35 -07:00
Raghavendra K T
a800cbddbc sched: Fix crash in sched_init_numa()
commit 9c03ee147193645be4c186d3688232fa438c57c7 upstream.

The following PowerPC commit:

  c118baf802 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes")

avoids allocating bootmem memory for non existent nodes.

But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot
because in sched_init_numa(), cpumask_or() operation was done on
unallocated nodes.

Fix that by making cpumask_or() operation only on existing nodes.

[ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ]

Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: <gkurz@linux.vnet.ibm.com>
Cc: <grant.likely@linaro.org>
Cc: <nikunj@linux.vnet.ibm.com>
Cc: <vdavydov@parallels.com>
Cc: <linuxppc-dev@lists.ozlabs.org>
Cc: <linux-mm@kvack.org>
Cc: <peterz@infradead.org>
Cc: <benh@kernel.crashing.org>
Cc: <paulus@samba.org>
Cc: <mpe@ellerman.id.au>
Cc: <anton@samba.org>
Link: http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17 12:31:02 -08:00
Andrey Ryabinin
9e0e83a1ec sched/fair: Fix multiplication overflow on 32-bit systems
Make 'r' 64-bit type to avoid overflow in 'r * LOAD_AVG_MAX'
on 32-bit systems:

	UBSAN: Undefined behaviour in kernel/sched/fair.c:2785:18
	signed integer overflow:
	87950 * 47742 cannot be represented in type 'int'

The most likely effect of this bug are bad load average numbers
resulting in weird scheduling. It's also likely that this can
persist for a longer time - until the system goes idle for
a long time so that all load avg numbers get reset.

[ This is the CFS load average metric, not the procfs output, which
  is separate. ]

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/1450097243-30137-1-git-send-email-aryabinin@virtuozzo.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:01:05 +01:00
Peter Zijlstra
dfd01f0260 sched/wait: Fix the signal handling fix
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/

His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().

We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed.  We must
instead pass the initial state along and use that.

Fixes: 68985633bc ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-13 14:30:59 -08:00
Linus Torvalds
5406812e59 Merge branch 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "More change than I'd have liked at this stage.  The pids controller
  and the changes made to cgroup core to support it introduced and
  revealed several important issues.

   - Assigning membership to a newly created task and migrating it can
     race leading to incorrect accounting.  Oleg fixed it by widening
     threadgroup synchronization.  It looks like we'll be able to merge
     it with a different percpu rwsem which is used in fork path making
     things simpler and cheaper.

   - The recent change to extend cgroup membership to zombies (so that
     pid accounting can extend till the pid is actually released) missed
     pinning the underlying data structures leading to use-after-free.
     Fixed.

   - v2 hierarchy was calling subsystem callbacks with the wrong target
     cgroup_subsys_state based on the incorrect assumption that they
     share the same target.  pids is the first controller affected by
     this.  Subsys callbacks updated so that they can deal with
     multi-target migrations"

* 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup_pids: don't account for the root cgroup
  cgroup: fix handling of multi-destination migration from subtree_control enabling
  cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach()
  cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork()
  cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()
  cgroup: make css_set pin its css's to avoid use-afer-free
  cgroup: fix cftype->file_offset handling
2015-12-08 13:35:52 -08:00
Linus Torvalds
51825c8a86 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This tree includes four core perf fixes for misc bugs, three fixes to
  x86 PMU drivers, and two updates to old email addresses"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Do not send exit event twice
  perf/x86/intel: Fix INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA macro
  perf/x86/intel: Make L1D_PEND_MISS.FB_FULL not constrained on Haswell
  perf: Fix PERF_EVENT_IOC_PERIOD deadlock
  treewide: Remove old email address
  perf/x86: Fix LBR call stack save/restore
  perf: Update email address in MAINTAINERS
  perf/core: Robustify the perf_cgroup_from_task() RCU checks
  perf/core: Fix RCU problem with cgroup context switching code
2015-12-08 13:01:23 -08:00
Tejun Heo
0b98f0c042 Merge branch 'master' into for-4.4-fixes
The following commit which went into mainline through networking tree

  3b13758f51 ("cgroups: Allow dynamically changing net_classid")

conflicts in net/core/netclassid_cgroup.c with the following pending
fix in cgroup/for-4.4-fixes.

  1f7dd3e5a6 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")

The former separates out update_classid() from cgrp_attach() and
updates it to walk all fds of all tasks in the target css so that it
can be used from both migration and config change paths.  The latter
drops @css from cgrp_attach().

Resolve the conflict by making cgrp_attach() call update_classid()
with the css from the first task.  We can revive @tset walking in
cgrp_attach() but given that net_cls is v1 only where there always is
only one target css during migration, this is fine.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Nina Schiff <ninasc@fb.com>
2015-12-07 10:09:03 -05:00
Peter Zijlstra
ecf7d01c22 sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()
Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.

Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.

        CPU0                            CPU1

        set_current_state(...)

        <preempt_schedule>
          context_switch(X, Y)
            prepare_lock_switch(Y)
              Y->on_cpu = 1;
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

                                        try_to_wake_up(X)
                                          LOCK(p->pi_lock);

                                          t = X->on_cpu; // 0

          context_switch(Y, X)
            prepare_lock_switch(X)
              X->on_cpu = 1;
            finish_lock_switch(Y)
              store_release(Y->on_cpu, 0);
        </preempt_schedule>

        schedule();
          deactivate_task(X);
          X->on_rq = 0;

                                          if (X->on_rq) // false

                                          if (t) while (X->on_cpu)
                                            cpu_relax();

          context_switch(X, ..)
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

Avoid the load of X->on_cpu being hoisted over the X->on_rq load.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:26:43 +01:00
Peter Zijlstra
b75a225315 sched/core: Better document the try_to_wake_up() barriers
Explain how the control dependency and smp_rmb() end up providing
ACQUIRE semantics and pair with smp_store_release() in
finish_lock_switch().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:26:42 +01:00
Hiroshi Shimamoto
2541117b0c sched/cputime: Fix invalid gtime in proc
/proc/stats shows invalid gtime when the thread is running in guest.
When vtime accounting is not enabled, we cannot get a valid delta.
The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
is only updated when vtime accounting is runtime enabled.

This patch makes task_gtime() just return gtime without computing the
buggy non-existing tickless delta when vtime accounting is not enabled.

Use context_tracking_is_enabled() to check if vtime is accounting on
some cpu, in which case only we need to check the tickless delta. This
way we fix the gtime value regression on machines not running nohz full.

The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.

I ran and stop a busy loop in VM and see the gtime in host.
Dump the 43rd field which shows the gtime in every second:

	 # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
	S 4348
	R 7064566
	R 7064766
	R 7064967
	R 7065168
	S 4759
	S 4759

During running busy loop, it returns large value.

After applying this patch, we can see right gtime.

	 # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
	S 5338
	R 5365
	R 5465
	R 5566
	R 5666
	S 5726
	S 5726

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:18:49 +01:00
Xunlei Pang
8295c69925 sched/core: Clear the root_domain cpumasks in init_rootdomain()
root_domain::rto_mask allocated through alloc_cpumask_var()
contains garbage data, this may cause problems. For instance,
When doing pull_rt_task(), it may do useless iterations if
rto_mask retains some extra garbage bits. Worse still, this
violates the isolated domain rule for clustered scheduling
using cpuset, because the tasks(with all the cpus allowed)
belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var()
instead of alloc_cpumask_var() for root_domain::rto_mask
allocation, thereby addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
dlo_mask, span, and online.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:16:21 +01:00
Sasha Levin
119d6f6a3b sched/core: Remove false-positive warning from wake_up_process()
Because wakeups can (fundamentally) be late, a task might not be in
the expected state. Therefore testing against a task's state is racy,
and can yield false positives.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: oleg@redhat.com
Fixes: 9067ac85d5 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task")
Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:10:16 +01:00
Peter Zijlstra
68985633bc sched/wait: Fix signal handling in bit wait helpers
Vladimir reported getting RCU stall warnings and bisected it back to
commit:

  743162013d ("sched: Remove proliferation of wait_on_bit() action functions")

That commit inadvertently reversed the calls to schedule() and signal_pending(),
thereby not handling the case where the signal receives while we sleep.

Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mark.rutland@arm.com
Cc: neilb@suse.de
Cc: oleg@redhat.com
Fixes: 743162013d ("sched: Remove proliferation of wait_on_bit() action functions")
Fixes: cbbce82209 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:10:15 +01:00
Tejun Heo
1f7dd3e5a6 cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.

  P0 (+memory) --- P1 (-memory) --- A
                                 \- B
       
P0 has memory enabled in its subtree_control while P1 doesn't.  If
both A and B contain processes, they would belong to the memory css of
P1.  Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter.  IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses.  pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

 WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
 Modules linked in:
 CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
 ...
  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
 Call Trace:
  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
  [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
  [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
  [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
  [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
  [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
  [<ffffffff81265f88>] __vfs_write+0x28/0xe0
  [<ffffffff812666fc>] vfs_write+0xac/0x1a0
  [<ffffffff81267019>] SyS_write+0x49/0xb0
  [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated.  All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
  target csses can be converted trivially.  cpu, io, freezer, perf,
  netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
  and destination and thus doesn't support v2 hierarchy already.  The
  only change made by this patchset is how that single destination css
  is obtained.

* memory migration path already doesn't do anything on v2.  How the
  single destination css is obtained is updated and the prep stage of
  mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug.  It now
  correctly handles multi-destination migrations and no longer causes
  counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 10:18:21 -05:00