Two watchdog changes that came through different trees had a non
conflicting conflict, that is, one changed the semantics of a variable
but no actual code conflict happened. So the merge appeared fine, but
the resulting code did not behave as expected.
Commit 195daf665a ("watchdog: enable the new user interface of the
watchdog mechanism") changes the semantics of watchdog_user_enabled,
which thereafter is only used by the functions introduced by
b3738d2932 ("watchdog: Add watchdog enable/disable all functions").
There further appears to be a distinct lack of serialization between
setting and using watchdog_enabled, so perhaps we should wrap the
{en,dis}able_all() things in watchdog_proc_mutex.
This patch fixes a s2r failure reported by Michal; which I cannot
readily explain. But this does make the code internally consistent
again.
Reported-and-tested-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
static code checking was unhappy with:
./kernel/sched/fair.c:162 WARNING: return of wrong type
int != unsigned int
get_update_sysctl_factor() is declared to return int but is
currently returning an unsigned int. The first few preprocessed
lines are:
static int get_update_sysctl_factor(void)
{
unsigned int cpus = ({ int __min1 = (cpumask_weight(cpu_online_mask));
int __min2 = (8); __min1 < __min2 ? __min1: __min2; });
unsigned int factor;
The type used by min_t() should be 'unsigned int' and the return type
of get_update_sysctl_factor() should also be 'unsigned int' as its
call-site update_sysctl() is expecting 'unsigned int' and the values
utilizing:
'factor'
'sysctl_sched_min_granularity'
'sched_nr_latency'
'sysctl_sched_wakeup_granularity'
... are also all 'unsigned int', plus cpumask_weight() is also
returning 'unsigned int'.
So the natural type to use around here is 'unsigned int'.
( Patch was compile tested with x86_64_defconfig +
CONFIG_SCHED_DEBUG=y and the changed sections in
kernel/sched/fair.i were reviewed. )
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
[ Improved the changelog a bit. ]
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431716742-11077-1-git-send-email-hofrat@osadl.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Ingo Molnar:
"Two fixes: a suspend/resume related regression fix, and an RT priority
boosting fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix regression in cpuset_cpu_inactive() for suspend
sched: Handle priority boosted tasks proper in setscheduler()
Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also a lockdep annotation fix, a PMU event
list fix and a new model addition"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/liblockdep: Fix compilation error
tools/liblockdep: Fix linker error in case of cross compile
perf tools: Use getconf to determine number of online CPUs
tools: Fix tools/vm build
perf/x86/rapl: Enable Broadwell-U RAPL support
perf/x86/intel: Fix SLM cache event list
perf: Annotate inherited event ctx->mutex recursion
Pull irq updates from Thomas Gleixner:
"Two patches from the irq departement:
- a simple fix to make dummy_irq_chip usable for wakeup scenarios
- removal of the gic arch_extn hackery. Now that all users are
converted we really want to get rid of the interface so people wont
come up with new use cases"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: gic: Drop support for gic_arch_extn
genirq: Set IRQCHIP_SKIP_SET_WAKE flag for dummy_irq_chip
Pull timer fix from Thomas Gleixner:
"A simple fix to actually shut down a detached device instead of
keeping it active"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clockevents: Shutdown detached clockevent device
Pull tracing fix from Steven Rostedt:
"The newly added ftrace_print_array_seq() function had a bug in it.
Luckily, the only user of it didn't make the 4.1 merge window.
But the helper function should be fixed before 4.2 when the users
start coming in"
* tag 'trace-fixes-v4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Make ftrace_print_array_seq compute buf_len
Given the overall futex architecture, any chance of reducing
hb->lock contention is welcome. In this particular case, using
wake-queues to enable lockless wakeups addresses very much real
world performance concerns, even cases of soft-lockups in cases
of large amounts of blocked tasks (which is not hard to find in
large boxes, using but just a handful of futex).
At the lowest level, this patch can reduce latency of a single thread
attempting to acquire hb->lock in highly contended scenarios by a
up to 2x. At lower counts of nr_wake there are no regressions,
confirming, of course, that the wake_q handling overhead is practically
non existent. For instance, while a fair amount of variation,
the extended pef-bench wakeup benchmark shows for a 20 core machine
the following avg per-thread time to wakeup its share of tasks:
nr_thr ms-before ms-after
16 0.0590 0.0215
32 0.0396 0.0220
48 0.0417 0.0182
64 0.0536 0.0236
80 0.0414 0.0097
96 0.0672 0.0152
Naturally, this can cause spurious wakeups. However there is no core code
that cannot handle them afaict, and furthermore tglx does have the point
that other events can already trigger them anyway.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Mason <clm@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: George Spelvin <linux@horizon.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430494072-30283-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is useful for locking primitives that can effect multiple
wakeups per operation and want to avoid lock internal lock contention
by delaying the wakeups until we've released the lock internal locks.
Alternatively it can be used to avoid issuing multiple wakeups, and
thus save a few cycles, in packet processing. Queue all target tasks
and wakeup once you've processed all packets. That way you avoid
waking the target task multiple times if there were multiple packets
for the same task.
Properties of a wake_q are:
- Lockless, as queue head must reside on the stack.
- Being a queue, maintains wakeup order passed by the callers. This can
be important for otherwise, in scenarios where highly contended locks
could affect any reliance on lock fairness.
- A queued task cannot be added again until it is woken up.
This patch adds the needed infrastructure into the scheduler code
and uses the new wake_list to delay the futex wakeups until
after we've released the hash bucket locks.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[tweaks, adjustments, comments, etc.]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Mason <clm@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: George Spelvin <linux@horizon.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430494072-30283-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While running a database workload, we found a scalability issue with itimers.
Much of the problem was caused by the thread_group_cputimer spinlock.
Each time we account for group system/user time, we need to obtain a
thread_group_cputimer's spinlock to update the timers. On larger systems
(such as a 16 socket machine), this caused more than 30% of total time
spent trying to obtain this kernel lock to update these group timer stats.
This patch converts the timers to 64-bit atomic variables and use
atomic add to update them without a lock. With this patch, the percent
of total time spent updating thread group cputimer timers was reduced
from 30% down to less than 1%.
Note: On 32-bit systems using the generic 64-bit atomics, this causes
sample_group_cputimer() to take locks 3 times instead of just 1 time.
However, we tested this patch on a 32-bit system ARM system using the
generic atomics and did not find the overhead to be much of an issue.
An explanation for why this isn't an issue is that 32-bit systems usually
have small numbers of CPUs, and cacheline contention from extra spinlocks
called periodically is not really apparent on smaller systems.
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I could not find the loadavg code.. turns out it was hidden in a file
called proc.c. It further got mingled up with the cruft per rq load
indexes (which we really want to get rid of).
Move the per rq load indexes into the fair.c load-balance code (that's
the only thing that uses them) and rename proc.c to loadavg.c so we
can find it again.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ Did minor cleanups to the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 3c18d447b3 ("sched/core: Check for available DL bandwidth in
cpuset_cpu_inactive()"), a SCHED_DEADLINE bugfix, had a logic error that
caused a regression in setting a CPU inactive during suspend. I ran into
this when a program was failing pthread_setaffinity_np() with EINVAL after
a suspend+wake up.
A simple reproducer:
$ ./a.out
sched_setaffinity: Success
$ systemctl suspend
$ ./a.out
sched_setaffinity: Invalid argument
... where ./a.out is:
#define _GNU_SOURCE
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(void)
{
long num_cores;
cpu_set_t cpu_set;
int ret;
num_cores = sysconf(_SC_NPROCESSORS_ONLN);
CPU_ZERO(&cpu_set);
CPU_SET(num_cores - 1, &cpu_set);
errno = 0;
ret = sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set);
perror("sched_setaffinity");
return ret ? EXIT_FAILURE : EXIT_SUCCESS;
}
The mistake is that suspend is handled in the action ==
CPU_DOWN_PREPARE_FROZEN case of the switch statement in
cpuset_cpu_inactive().
However, the commit in question masked out CPU_TASKS_FROZEN
from the action, making this case dead.
The fix is straightforward.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 3c18d447b3 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()")
Link: http://lkml.kernel.org/r/1cb5ecb3d6543c38cce5790387f336f54ec8e2bc.1430733960.git.osandov@osandov.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ronny reported that the following scenario is not handled correctly:
T1 (prio = 10)
lock(rtmutex);
T2 (prio = 20)
lock(rtmutex)
boost T1
T1 (prio = 20)
sys_set_scheduler(prio = 30)
T1 prio = 30
....
sys_set_scheduler(prio = 10)
T1 prio = 30
The last step is wrong as T1 should now be back at prio 20.
Commit c365c292d0 ("sched: Consider pi boosting in setscheduler()")
only handles the case where a boosted tasks tries to lower its
priority.
Fix it by taking the new effective priority into account for the
decision whether a change of the priority is required.
Reported-by: Ronny Meeus <ronny.meeus@gmail.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Fixes: c365c292d0 ("sched: Consider pi boosting in setscheduler()")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>