remove the sleep-bonus interactivity code from the core scheduler.
scheduling policy is implemented in the policy modules, and CFS does
not need such type of heuristics.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the expired_starving() heuristics from the core scheduler.
CFS does not need it, and this did not really work well in practice
anyway, due to the rq->nr_running multiplier to STARVATION_LIMIT.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the sleep_type heuristics from the core scheduler - scheduling
policy is implemented in the scheduling-policy modules. (and CFS does
not use this type of sleep-type heuristics)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cleanup: move dequeue/enqueue_task() to a more logical place, to
not split up __normal_prio()/normal_prio().
Signed-off-by: Ingo Molnar <mingo@elte.hu>
move resched_task()/resched_cpu() into the 'public interfaces'
section of sched.c, for use by kernel/sched_fair/rt/idletask.c
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add rq_clock()/__rq_clock(), a robust wrapper around sched_clock(),
used by CFS. It protects against common type of sched_clock() problems
(caused by hardware): time warps forwards and backwards.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add the CFS rq data types to sched.c.
(the old scheduler fields are still intact, they are removed
by a later patch)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
create sched_stats.h and move sched.c schedstats code into it.
This cleans up sched.c a bit.
no code changes are caused by this patch.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add the init_idle_bootup_task() callback to the bootup thread,
unused at the moment. (CFS will use it to switch the scheduling
class of the boot thread to the idle class)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove sched_exit(): the elaborate dance of us trying to recover
timeslices given to child tasks never really worked.
CFS does not need it either.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
the SMP load-balancer uses the boot-time migration-cost estimation
code to attempt to improve the quality of balancing. The reason for
this code is that the discrete priority queues do not preserve
the order of scheduling accurately, so the load-balancer skips
tasks that were running on a CPU 'recently'.
this code is fundamental fragile: the boot-time migration cost detector
doesnt really work on systems that had large L3 caches, it caused boot
delays on large systems and the whole cache-hot concept made the
balancing code pretty undeterministic as well.
(and hey, i wrote most of it, so i can say it out loud that it sucks ;-)
under CFS the same purpose of cache affinity can be achieved without
any special cache-hot special-case: tasks are sorted in the 'timeline'
tree and the SMP balancer picks tasks from the left side of the
tree, thus the most cache-cold task is balanced automatically.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
enum idle_type (used by the load-balancer) clashes with the
SCHED_IDLE name that we want to introduce. 'CPU_IDLE' instead
of 'SCHED_IDLE' is more descriptive as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The intervals of domains that do not have SD_BALANCE_NEWIDLE must be
considered for the calculation of the time of the next balance. Otherwise
we may defer rebalancing forever.
Siddha also spotted that the conversion of the balance interval
to jiffies is missing. Fix that to.
From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
also continue the loop if !(sd->flags & SD_LOAD_BALANCE).
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It did in fact trigger under all three of mainline, CFS, and -rt including CFS
-- see below for a couple of emails from last Friday giving results for these
three on the AMD box (where it happened) and on a single-quad NUMA-Q system
(where it did not, at least not with such severity).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.
He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).
This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section. If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.
So the code now looks like this:
repeat:
/* Unlocked, optimistic looping! */
rq = task_rq(p);
while (task_running(rq, p))
cpu_relax();
/* Get the *real* values */
rq = task_rq_lock(p, &flags);
running = task_running(rq, p);
array = p->array;
task_rq_unlock(rq, &flags);
/* Check them.. */
if (unlikely(running)) {
cpu_relax();
goto repeat;
}
/* Preempted away? Yield if so.. */
if (unlikely(array)) {
yield();
goto repeat;
}
Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care. Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.
So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_. And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.
(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).
Thanks to Miklos for all the testing and tracking it down.
Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gene Heskett reported the following problem while testing CFS: SysRq-N
is not always effective in normalizing tasks back to SCHED_OTHER.
The reason for that turns out to be the following bug:
- normalize_rt_tasks() uses for_each_process() to iterate through all
tasks in the system. The problem is, this method does not iterate
through all tasks, it iterates through all thread groups.
The proper mechanism to enumerate over all threads is to use a
do_each_thread() + while_each_thread() loop.
Reported-by: Gene Heskett <gene.heskett@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NOHZ patch contains a check for softirqs pending when a CPU goes idle.
The BUG is unrelated to NOHZ, it just was made visible by the NOHZ patch.
The BUG showed up mainly on P4 / hyperthreading enabled machines which lead
the investigations into the wrong direction in the first place. The real
cause is in cond_resched_softirq():
cond_resched_softirq() is enabling softirqs without invoking the softirq
daemon when softirqs are pending. This leads to the warning message in the
NOHZ idle code:
t1 runs softirq disabled code on CPU#0
interrupt happens, softirq is raised, but deferred (softirqs disabled)
t1 calls cond_resched_softirq()
enables softirqs via _local_bh_enable()
calls schedule()
t2 runs
t1 is migrated to CPU#1
t2 is done and invokes idle()
NOHZ detects the pending softirq
Fix: change _local_bh_enable() to local_bh_enable() so the softirq
daemon is invoked.
Thanks to Anant Nitya for debugging this with great patience !
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).
[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eliminate lock_cpu_hotplug from kernel/sched.c and use sched_hotcpu_mutex
instead to postpone a hotplug event.
In the migration_call hotcpu callback function, take sched_hotcpu_mutex
while handling the event CPU_LOCK_ACQUIRE and release it while handling
CPU_LOCK_RELEASE event.
[akpm@linux-foundation.org: fix deadlock]
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit bd53f96ca5.
Con says:
This is no good, sorry. The one I saw originally was with the staircase
deadline cpu scheduler in situ and was different.
#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)
(((p)->prio < (rq)->curr->prio) && ((p)->array == (rq)->active))
This will fail to wake up a runqueue for a task that has been migrated to the
expired array of a runqueue which is otherwise idle which can happen with smp
balancing,
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>