This patch touches the RT group scheduling case.
Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.
The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq. The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.
The patch below fixes the problem.
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
dereference on sd_busy but the fix also altered what scheduling domain it
used for the 'sd_llc' percpu variable.
One impact of this is that a task selecting a runqueue may consider
idle CPUs that are not cache siblings as candidates for running.
Tasks are then running on CPUs that are not cache hot.
This was found through bisection where ebizzy threads were not seeing equal
performance and it looked like a scheduling fairness issue. This patch
mitigates but does not completely fix the problem on all machines tested
implying there may be an additional bug or a common root cause. Here are
the average range of performance seen by individual ebizzy threads. It
was tested on top of candidate patches related to x86 TLB range flushing.
4-core machine
3.13.0-rc3 3.13.0-rc3
vanilla fixsd-v3r3
Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%)
Mean 2 0.34 ( 0.00%) 0.10 ( 70.59%)
Mean 3 1.29 ( 0.00%) 0.93 ( 27.91%)
Mean 4 7.08 ( 0.00%) 0.77 ( 89.12%)
Mean 5 193.54 ( 0.00%) 2.14 ( 98.89%)
Mean 6 151.12 ( 0.00%) 2.06 ( 98.64%)
Mean 7 115.38 ( 0.00%) 2.04 ( 98.23%)
Mean 8 108.65 ( 0.00%) 1.92 ( 98.23%)
8-core machine
Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%)
Mean 2 0.40 ( 0.00%) 0.21 ( 47.50%)
Mean 3 23.73 ( 0.00%) 0.89 ( 96.25%)
Mean 4 12.79 ( 0.00%) 1.04 ( 91.87%)
Mean 5 13.08 ( 0.00%) 2.42 ( 81.50%)
Mean 6 23.21 ( 0.00%) 69.46 (-199.27%)
Mean 7 15.85 ( 0.00%) 101.72 (-541.77%)
Mean 8 109.37 ( 0.00%) 19.13 ( 82.51%)
Mean 12 124.84 ( 0.00%) 28.62 ( 77.07%)
Mean 16 113.50 ( 0.00%) 24.16 ( 78.71%)
It's eliminated for one machine and reduced for another.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: H Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
results in him occasionally seeing time going backwards - which
crashes the scheduler ...
Most of our time accounting can actually handle that except the most
common one; the tick time update of sched_fair.
There is a further problem with that code; previously we assumed that
because we get a tick every TICK_NSEC our time delta could never
exceed 32bits and math was simpler.
However, ever since Frederic managed to get NO_HZ_FULL merged; this is
no longer the case since now a task can run for a long time indeed
without getting a tick. It only takes about ~4.2 seconds to overflow
our u32 in nanoseconds.
This means we not only need to better deal with time going backwards;
but also means we need to be able to deal with large deltas.
This patch reworks the entire code and uses mul_u64_u32_shr() as
proposed by Andy a long while ago.
We express our virtual time scale factor in a u32 multiplier and shift
right and the 32bit mul_u64_u32_shr() implementation reduces to a
single 32x32->64 multiply if the time delta is still short (common
case).
For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.
Reported-and-Tested-by: Christian Engelmayer <cengelma@gmx.at>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: fweisbec@gmail.com
Cc: Paul Turner <pjt@google.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
Make sure to always initialize power_orig now that we actually use it.
Ideally build_sched_domains() -> init_sched_groups_power() would also
initialize this; but for some yet unexplained reason some setups seem
to miss updates there.
Reported-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull tracing fix from Steven Rostedt:
"A regression showed up that there's a large delay when enabling all
events. This was prevalent when FTRACE_SELFTEST was enabled which
enables all events several times, and caused the system bootup to
pause for over a minute.
This was tracked down to an addition of a synchronize_sched()
performed when system call tracepoints are unregistered.
The synchronize_sched() is needed between the unregistering of the
system call tracepoint and a deletion of a tracing instance buffer.
But placing the synchronize_sched() in the unreg of *every* system
call tracepoint is a bit overboard. A single synchronize_sched()
before the deletion of the instance is sufficient"
* tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Only run synchronize_sched() at instance deletion time
It has been reported that boot up with FTRACE_SELFTEST enabled can take a
very long time. There can be stalls of over a minute.
This was tracked down to the synchronize_sched() called when a system call
event is disabled. As the self tests enable and disable thousands of events,
this makes the synchronize_sched() get called thousands of times.
The synchornize_sched() was added with d562aff93b "tracing: Add support
for SOFT_DISABLE to syscall events" which caused this regression (added
in 3.13-rc1).
The synchronize_sched() is to protect against the events being accessed
when a tracer instance is being deleted. When an instance is being deleted
all the events associated to it are unregistered. The synchronize_sched()
makes sure that no more users are running when it finishes.
Instead of calling synchronize_sched() for all syscall events, we only
need to call it once, after the events are unregistered and before the
instance is deleted. The event_mutex is held during this action to
prevent new users from enabling events.
Link: http://lkml.kernel.org/r/20131203124120.427b9661@gandalf.local.home
Reported-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Petr Mladek <pmladek@suse.cz>
Tested-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull timer fixes from Thomas Gleixner:
- timekeeping: Cure a subtle drift issue on GENERIC_TIME_VSYSCALL_OLD
- nohz: Make CONFIG_NO_HZ=n and nohz=off command line option behave the
same way. Fixes a long standing load accounting wreckage.
- clocksource/ARM: Kconfig update to avoid ARM=n wreckage
- clocksource/ARM: Fixlets for the AT91 and SH clocksource/clockevents
- Trivial documentation update and kzalloc conversion from akpms pile
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz: Fix another inconsistency between CONFIG_NO_HZ=n and nohz=off
time: Fix 1ns/tick drift w/ GENERIC_TIME_VSYSCALL_OLD
clocksource: arm_arch_timer: Hide eventstream Kconfig on non-ARM
clocksource: sh_tmu: Add clk_prepare/unprepare support
clocksource: sh_tmu: Release clock when sh_tmu_register() fails
clocksource: sh_mtu2: Add clk_prepare/unprepare support
clocksource: sh_mtu2: Release clock when sh_mtu2_register() fails
ARM: at91: rm9200: switch back to clockevents_config_and_register
tick: Document tick_do_timer_cpu
timer: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
NOHZ: Check for nohz active instead of nohz enabled
Pull irq fixes from Thomas Gleixner:
- Correction of fuzzy and fragile IRQ_RETVAL macro
- IRQ related resume fix affecting only XEN
- ARM/GIC fix for chained GIC controllers
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: Gic: fix boot for chained gics
irq: Enable all irqs unconditionally in irq_resume
genirq: Correct fuzzy and fragile IRQ_RETVAL() definition
Pull scheduler fixes from Ingo Molnar:
"Various smaller fixlets, all over the place"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/doc: Fix generation of device-drivers
sched: Expose preempt_schedule_irq()
sched: Fix a trivial typo in comments
sched: Remove unused variable in 'struct sched_domain'
sched: Avoid NULL dereference on sd_busy
sched: Check sched_domain before computing group power
MAINTAINERS: Update file patterns in the lockdep and scheduler entries
Pull perf fixes from Ingo Molnar:
"Misc kernel and tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools lib traceevent: Fix conversion of pointer to integer of different size
perf/trace: Properly use u64 to hold event_id
perf: Remove fragile swevent hlist optimization
ftrace, perf: Avoid infinite event generation loop
tools lib traceevent: Fix use of multiple options in processing field
perf header: Fix possible memory leaks in process_group_desc()
perf header: Fix bogus group name
perf tools: Tag thread comm as overriden
Pull workqueue fixes from Tejun Heo:
"This contains one important fix. The NUMA support added a while back
broke ordering guarantees on ordered workqueues. It was enforced by
having single frontend interface with @max_active == 1 but the NUMA
support puts multiple interfaces on unbound workqueues on NUMA
machines thus breaking the ordered guarantee. This is fixed by
disabling NUMA support on ordered workqueues.
The above and a couple other patches were sitting in for-3.12-fixes
but I forgot to push that out, so they ended up waiting a bit too
long. My aplogies.
Other fixes are minor"
* 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in init_workqueues
workqueue: fix comment typo for __queue_work()
workqueue: fix ordered workqueues in NUMA setups
workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY
Pull cgroup fixes from Tejun Heo:
"Fixes for three issues.
- cgroup destruction path could swamp system_wq possibly leading to
deadlock. This actually seems to happen in the wild with memcg
because memcg destruction path adds nested dependency on system_wq.
Resolved by isolating cgroup destruction work items on its
dedicated workqueue.
- Possible locking context deadlock through seqcount reported by
lockdep
- Memory leak under certain conditions"
* 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix cgroup_subsys_state leak for seq_files
cpuset: Fix memory allocator deadlock
cgroup: use a dedicated workqueue for cgroup destruction
If CONFIG_NO_HZ=n tick_nohz_get_sleep_length() returns NSEC_PER_SEC/HZ.
If CONFIG_NO_HZ=y and the nohz functionality is disabled via the
command line option "nohz=off" or not enabled due to missing hardware
support, then tick_nohz_get_sleep_length() returns 0. That happens
because ts->sleep_length is never set in that case.
Set it to NSEC_PER_SEC/HZ when the NOHZ mode is inactive.
Reported-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The init_kernel_text() and core_kernel_text() functions should not
include the labels _einittext and _etext when checking if an address is
inside the .text or .init sections.
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a cgroup file implements either read_map() or read_seq_string(),
such file is served using seq_file by overriding file->f_op to
cgroup_seqfile_operations, which also overrides the release method to
single_release() from cgroup_file_release().
Because cgroup_file_open() didn't use to acquire any resources, this
used to be fine, but since f7d58818ba ("cgroup: pin
cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
pins the css (cgroup_subsys_state) which is put by
cgroup_file_release(). The patch forgot to update the release path
for seq_files and each open/release cycle leaks a css reference.
Fix it by updating cgroup_file_release() to also handle seq_files and
using it for seq_file release path too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v3.12
Juri hit the below lockdep report:
[ 4.303391] ======================================================
[ 4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[ 4.303394] 3.12.0-dl-peterz+ #144 Not tainted
[ 4.303395] ------------------------------------------------------
[ 4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 4.303399] (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290
[ 4.303417]
[ 4.303417] and this task is already holding:
[ 4.303418] (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100
[ 4.303431] which would create a new lock dependency:
[ 4.303432] (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
[ 4.303436]
[ 4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[ 4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
[ 4.303922] HARDIRQ-ON-W at:
[ 4.303923] [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0
[ 4.303926] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
[ 4.303929] [<ffffffff81063dd6>] kthreadd+0x86/0x180
[ 4.303931] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
[ 4.303933] SOFTIRQ-ON-W at:
[ 4.303933] [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0
[ 4.303935] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
[ 4.303940] [<ffffffff81063dd6>] kthreadd+0x86/0x180
[ 4.303955] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
[ 4.303959] INITIAL USE at:
[ 4.303960] [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0
[ 4.303963] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
[ 4.303966] [<ffffffff81063dd6>] kthreadd+0x86/0x180
[ 4.303969] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
[ 4.303972] }
Which reports that we take mems_allowed_seq with interrupts enabled. A
little digging found that this can only be from
cpuset_change_task_nodemask().
This is an actual deadlock because an interrupt doing an allocation will
hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
forever waiting for the write side to complete.
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Reported-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org