There is a race between set_cpus_allowed() and move_task_off_dead_cpu().
__migrate_task() doesn't report any err code, so task can be left on its
runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a
possible target. Also, chaning cpus_allowed mask requires rq->lock being
held.
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Acked-By: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Unless we expect to have more than 2G CPUs, there's no reason to have 'i'
as a long long here.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
and not explicit enough. The sleep boost is not supposed to be any larger
than without this code and the comment is not clear enough about what
exactly it does, just the reason it does it. Fix it.
There is a ceiling to the priority beyond which tasks that only ever sleep
for very long periods cannot surpass. Fix it.
Prevent the on-runqueue bonus logic from defeating the idle sleep logic.
Opportunity to micro-optimise.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial report and lock contention fix from Chris Mason:
Recent benchmarks showed some performance regressions between 2.6.16 and
2.6.5. We tracked down one of the regressions to lock contention in
schedule heavy workloads (~70,000 context switches per second)
kernel/sched.c:dependent_sleeper() was responsible for most of the lock
contention, hammering on the run queue locks. The patch below is more of a
discussion point than a suggested fix (although it does reduce lock
contention significantly). The dependent_sleeper code looks very expensive
to me, especially for using a spinlock to bounce control between two
different siblings in the same cpu.
It is further optimized:
* perform dependent_sleeper check after next task is determined
* convert wake_sleeping_dependent to use trylock
* skip smt runqueue check if trylock fails
* optimize double_rq_lock now that smt nice is converted to trylock
* early exit in searching first SD_SHARE_CPUPOWER domain
* speedup fast path of dependent_sleeper
[akpm@osdl.org: cleanup]
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Chris Mason <mason@suse.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make notifier_calls associated with cpu_notifier as __cpuinit.
__cpuinit makes sure that the function is init time only unless
CONFIG_HOTPLUG_CPU is defined.
[akpm@osdl.org: section fix]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
During some profiling I noticed that default_idle causes a lot of
memory traffic. I think that is caused by the atomic operations
to clear/set the polling flag in thread_info. There is actually
no reason to make this atomic - only the idle thread does it
to itself, other CPUs only read it. So I moved it into ti->status.
Converted i386/x86-64/ia64 for now because that was the easiest
way to fix ACPI which also manipulates these flags in its idle
function.
Cc: Nick Piggin <npiggin@novell.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The introduction of SCHED_BATCH scheduling class with a value of 3 means
that the expression (p->policy & SCHED_FIFO) will return true if policy
is SCHED_BATCH or SCHED_FIFO.
Unfortunately, this expression is used in sys_sched_rr_get_interval()
and in the absence of a comment to say that this is intentional I
presume that it is unintentional and erroneous.
The fix is to change the expression to (p->policy == SCHED_FIFO).
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If a cpu hotplug callback fails on CPU_UP_PREPARE, all callbacks will be
called with CPU_UP_CANCELED. A few of these callbacks assume that on
CPU_UP_PREPARE a pointer to task has been stored in a percpu array. This
assumption is not true if CPU_UP_PREPARE fails and the following calls to
kthread_bind() in CPU_UP_CANCELED will cause an addressing exception
because of passing a NULL pointer.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are several instances of per_cpu(foo, raw_smp_processor_id()), which
is semantically equivalent to __get_cpu_var(foo) but without the warning
that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled. For
those architectures with optimized per-cpu implementations, namely ia64,
powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower
code than __get_cpu_var(), so it would be preferable to use __get_cpu_var
on those platforms.
This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x,
raw_smp_processor_id()) on architectures that use the generic per-cpu
implementation, and turns into __get_cpu_var(x) on the architectures that
have an optimized per-cpu implementation.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
add the __might_sleep() check back to cond_resched().
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds LSM hooks into the setaffinity and getaffinity functions to
enable security modules to control these operations between tasks with
task_setscheduler and task_getscheduler LSM hooks.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This reverts commit 5ce74abe78 (and its
dependent commit 8a5bc075b8), because of
audio underruns.
Reported by Rene Herman <rene.herman@keyaccess.nl>, who also pinpointed
the exact cause of the underruns:
"Audio underruns galore, with only ogg123 and firefox (browsing the
GIT tree online is also a nice trigger by the way).
If I back it out, everything is fine for me again."
Cc: Rene Herman <rene.herman@keyaccess.nl>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Few of the notifier_chain_register() callers use __devinitdata in the
definition of notifier_block data structure. It is incorrect as the
data structure should be available after the initializations (they do
not unregister them during initializations).
This was leading to an oops when notifier_chain_register() call is
invoked for those callback chains after initialization.
This patch fixes all such usages to _not_ have the notifier_block data
structure in the init data section.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
RT tasks are being awakened on the expired array when expired_starving() is
true, whereas they really should be excluded. Fix.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a starvation problem that occurs when a stream of highly interactive tasks
delay an array switch for extended periods despite EXPIRED_STARVING(rq) being
true. AFAIKT, the only choice is to enqueue awakening tasks on the expired
array in this case.
Without this patch, it can be nearly impossible to remotely login to a busy
server, and interactive shell commands can starve for minutes.
Also, convert the EXPIRED_STARVING macro into an inline function which humans
can understand.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To increase the strength of SCHED_BATCH as a scheduling hint we can
activate batch tasks on the expired array since by definition they are
latency insensitive tasks.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On runqueue time is used to elevate priority in schedule().
In the code it currently requeues tasks even if their priority is not
elevated, which would end up placing them at the end of their runqueue
array effectively delaying them instead of improving their priority.
Bug spotted by Mike Galbraith <efault@gmx.de>
This patch removes this requeueing.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tasks waiting in SLEEP_NONINTERACTIVE state can now get to best priority so
they need to be included in the idle detection code.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We watch for tasks that sleep extended periods and don't allow one single
prolonged sleep period from elevating priority to maximum bonus to prevent cpu
bound tasks from getting high priority with single long sleeps. There is a
bug in the current code that also penalises tasks that already have high
priority. Correct that bug.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Alterations to the pipe code in the kernel made it possible for relative
starvation to occur with tasks that slept waiting on a pipe getting unfair
priority bonuses even if they were otherwise fully cpu bound so the
TASK_NONINTERACTIVE flag was introduced which prevented any change to
sleep_avg while sleeping waiting on a pipe. This change also leads to the
converse though, preventing any priority boost from occurring in truly
interactive tasks that wait on pipes.
Convert the TASK_NONINTERACTIVE flag to set sleep_type to SLEEP_NONINTERACTIVE
which will allow a linear bonus to priority based on sleep time thus allowing
interactive tasks to get high priority if they sleep enough.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The activated flag in task_struct is used to track different sleep types and
its usage is somewhat obfuscated. Convert the variable to an enum with more
descriptive names without altering the function.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, count_active_tasks() calls both nr_running() &
nr_interruptible(). Each of these functions does a "for_each_cpu" & reads
values from the runqueue of each cpu. Although this is not a lot of
instructions, each runqueue may be located on different node. Depending on
the architecture, a unique TLB entry may be required to access each
runqueue.
Since there may be more runqueues than cpu TLB entries, a scan of all
runqueues can trash the TLB. Each memory reference incurs a TLB miss &
refill.
In addition, the runqueue cacheline that contains nr_running &
nr_uninterruptible may be evicted from the cache between the two passes.
This causes unnecessary cache misses.
Combining nr_running() & nr_interruptible() into a single function
substantially reduces the TLB & cache misses on large systems. This should
have no measureable effect on smaller systems.
On a 128p IA64 system running a memory stress workload, the new function
reduced the overhead of calc_load() from 605 usec/call to 324 usec/call.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>