You've already forked linux-apfs
mirror of
https://github.com/linux-apfs/linux-apfs.git
synced 2026-05-01 15:00:59 -07:00
Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: (61 commits) sched: refine negative nice level granularity sched: fix update_stats_enqueue() reniced codepath sched: round a bit better sched: make the multiplication table more accurate sched: optimize update_rq_clock() calls in the load-balancer sched: optimize activate_task() sched: clean up set_curr_task_fair() sched: remove __update_rq_clock() call from entity_tick() sched: move the __update_rq_clock() call to scheduler_tick() sched debug: remove the 'u64 now' parameter from print_task()/_rq() sched: remove the 'u64 now' local variables sched: remove the 'u64 now' parameter from deactivate_task() sched: remove the 'u64 now' parameter from dequeue_task() sched: remove the 'u64 now' parameter from enqueue_task() sched: remove the 'u64 now' parameter from dec_nr_running() sched: remove the 'u64 now' parameter from inc_nr_running() sched: remove the 'u64 now' parameter from dec_load() sched: remove the 'u64 now' parameter from inc_load() sched: remove the 'u64 now' parameter from update_curr_load() sched: remove the 'u64 now' parameter from ->task_new() ...
This commit is contained in:
@@ -83,7 +83,7 @@ Some implementation details:
|
||||
CFS uses nanosecond granularity accounting and does not rely on any
|
||||
jiffies or other HZ detail. Thus the CFS scheduler has no notion of
|
||||
'timeslices' and has no heuristics whatsoever. There is only one
|
||||
central tunable:
|
||||
central tunable (you have to switch on CONFIG_SCHED_DEBUG):
|
||||
|
||||
/proc/sys/kernel/sched_granularity_ns
|
||||
|
||||
|
||||
@@ -0,0 +1,108 @@
|
||||
This document explains the thinking about the revamped and streamlined
|
||||
nice-levels implementation in the new Linux scheduler.
|
||||
|
||||
Nice levels were always pretty weak under Linux and people continuously
|
||||
pestered us to make nice +19 tasks use up much less CPU time.
|
||||
|
||||
Unfortunately that was not that easy to implement under the old
|
||||
scheduler, (otherwise we'd have done it long ago) because nice level
|
||||
support was historically coupled to timeslice length, and timeslice
|
||||
units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
|
||||
|
||||
In the O(1) scheduler (in 2003) we changed negative nice levels to be
|
||||
much stronger than they were before in 2.4 (and people were happy about
|
||||
that change), and we also intentionally calibrated the linear timeslice
|
||||
rule so that nice +19 level would be _exactly_ 1 jiffy. To better
|
||||
understand it, the timeslice graph went like this (cheesy ASCII art
|
||||
alert!):
|
||||
|
||||
|
||||
A
|
||||
\ | [timeslice length]
|
||||
\ |
|
||||
\ |
|
||||
\ |
|
||||
\ |
|
||||
\|___100msecs
|
||||
|^ . _
|
||||
| ^ . _
|
||||
| ^ . _
|
||||
-*----------------------------------*-----> [nice level]
|
||||
-20 | +19
|
||||
|
|
||||
|
|
||||
|
||||
So that if someone wanted to really renice tasks, +19 would give a much
|
||||
bigger hit than the normal linear rule would do. (The solution of
|
||||
changing the ABI to extend priorities was discarded early on.)
|
||||
|
||||
This approach worked to some degree for some time, but later on with
|
||||
HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
|
||||
we felt to be a bit excessive. Excessive _not_ because it's too small of
|
||||
a CPU utilization, but because it causes too frequent (once per
|
||||
millisec) rescheduling. (and would thus trash the cache, etc. Remember,
|
||||
this was long ago when hardware was weaker and caches were smaller, and
|
||||
people were running number crunching apps at nice +19.)
|
||||
|
||||
So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
|
||||
right minimal granularity - and this translates to 5% CPU utilization.
|
||||
But the fundamental HZ-sensitive property for nice+19 still remained,
|
||||
and we never got a single complaint about nice +19 being too _weak_ in
|
||||
terms of CPU utilization, we only got complaints about it (still) being
|
||||
too _strong_ :-)
|
||||
|
||||
To sum it up: we always wanted to make nice levels more consistent, but
|
||||
within the constraints of HZ and jiffies and their nasty design level
|
||||
coupling to timeslices and granularity it was not really viable.
|
||||
|
||||
The second (less frequent but still periodically occuring) complaint
|
||||
about Linux's nice level support was its assymetry around the origo
|
||||
(which you can see demonstrated in the picture above), or more
|
||||
accurately: the fact that nice level behavior depended on the _absolute_
|
||||
nice level as well, while the nice API itself is fundamentally
|
||||
"relative":
|
||||
|
||||
int nice(int inc);
|
||||
|
||||
asmlinkage long sys_nice(int increment)
|
||||
|
||||
(the first one is the glibc API, the second one is the syscall API.)
|
||||
Note that the 'inc' is relative to the current nice level. Tools like
|
||||
bash's "nice" command mirror this relative API.
|
||||
|
||||
With the old scheduler, if you for example started a niced task with +1
|
||||
and another task with +2, the CPU split between the two tasks would
|
||||
depend on the nice level of the parent shell - if it was at nice -10 the
|
||||
CPU split was different than if it was at +5 or +10.
|
||||
|
||||
A third complaint against Linux's nice level support was that negative
|
||||
nice levels were not 'punchy enough', so lots of people had to resort to
|
||||
run audio (and other multimedia) apps under RT priorities such as
|
||||
SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
|
||||
proof, and a buggy SCHED_FIFO app can also lock up the system for good.
|
||||
|
||||
The new scheduler in v2.6.23 addresses all three types of complaints:
|
||||
|
||||
To address the first complaint (of nice levels being not "punchy"
|
||||
enough), the scheduler was decoupled from 'time slice' and HZ concepts
|
||||
(and granularity was made a separate concept from nice levels) and thus
|
||||
it was possible to implement better and more consistent nice +19
|
||||
support: with the new scheduler nice +19 tasks get a HZ-independent
|
||||
1.5%, instead of the variable 3%-5%-9% range they got in the old
|
||||
scheduler.
|
||||
|
||||
To address the second complaint (of nice levels not being consistent),
|
||||
the new scheduler makes nice(1) have the same CPU utilization effect on
|
||||
tasks, regardless of their absolute nice levels. So on the new
|
||||
scheduler, running a nice +10 and a nice 11 task has the same CPU
|
||||
utilization "split" between them as running a nice -5 and a nice -4
|
||||
task. (one will get 55% of the CPU, the other 45%.) That is why nice
|
||||
levels were changed to be "multiplicative" (or exponential) - that way
|
||||
it does not matter which nice level you start out from, the 'relative
|
||||
result' will always be the same.
|
||||
|
||||
The third complaint (of negative nice levels not being "punchy" enough
|
||||
and forcing audio apps to run under the more dangerous SCHED_FIFO
|
||||
scheduling policy) is addressed by the new scheduler almost
|
||||
automatically: stronger negative nice levels are an automatic
|
||||
side-effect of the recalibrated dynamic range of nice levels.
|
||||
+9
-11
@@ -139,7 +139,7 @@ struct cfs_rq;
|
||||
extern void proc_sched_show_task(struct task_struct *p, struct seq_file *m);
|
||||
extern void proc_sched_set_task(struct task_struct *p);
|
||||
extern void
|
||||
print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq, u64 now);
|
||||
print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
|
||||
#else
|
||||
static inline void
|
||||
proc_sched_show_task(struct task_struct *p, struct seq_file *m)
|
||||
@@ -149,7 +149,7 @@ static inline void proc_sched_set_task(struct task_struct *p)
|
||||
{
|
||||
}
|
||||
static inline void
|
||||
print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq, u64 now)
|
||||
print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
|
||||
{
|
||||
}
|
||||
#endif
|
||||
@@ -855,26 +855,24 @@ struct sched_domain;
|
||||
struct sched_class {
|
||||
struct sched_class *next;
|
||||
|
||||
void (*enqueue_task) (struct rq *rq, struct task_struct *p,
|
||||
int wakeup, u64 now);
|
||||
void (*dequeue_task) (struct rq *rq, struct task_struct *p,
|
||||
int sleep, u64 now);
|
||||
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
|
||||
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
|
||||
void (*yield_task) (struct rq *rq, struct task_struct *p);
|
||||
|
||||
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
|
||||
|
||||
struct task_struct * (*pick_next_task) (struct rq *rq, u64 now);
|
||||
void (*put_prev_task) (struct rq *rq, struct task_struct *p, u64 now);
|
||||
struct task_struct * (*pick_next_task) (struct rq *rq);
|
||||
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
|
||||
|
||||
int (*load_balance) (struct rq *this_rq, int this_cpu,
|
||||
unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
|
||||
struct rq *busiest,
|
||||
unsigned long max_nr_move, unsigned long max_load_move,
|
||||
struct sched_domain *sd, enum cpu_idle_type idle,
|
||||
int *all_pinned, unsigned long *total_load_moved);
|
||||
int *all_pinned, int *this_best_prio);
|
||||
|
||||
void (*set_curr_task) (struct rq *rq);
|
||||
void (*task_tick) (struct rq *rq, struct task_struct *p);
|
||||
void (*task_new) (struct rq *rq, struct task_struct *p, u64 now);
|
||||
void (*task_new) (struct rq *rq, struct task_struct *p);
|
||||
};
|
||||
|
||||
struct load_weight {
|
||||
|
||||
+178
-161
File diff suppressed because it is too large
Load Diff
@@ -29,7 +29,7 @@
|
||||
} while (0)
|
||||
|
||||
static void
|
||||
print_task(struct seq_file *m, struct rq *rq, struct task_struct *p, u64 now)
|
||||
print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
|
||||
{
|
||||
if (rq->curr == p)
|
||||
SEQ_printf(m, "R");
|
||||
@@ -56,7 +56,7 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p, u64 now)
|
||||
#endif
|
||||
}
|
||||
|
||||
static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu, u64 now)
|
||||
static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
|
||||
{
|
||||
struct task_struct *g, *p;
|
||||
|
||||
@@ -77,7 +77,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu, u64 now)
|
||||
if (!p->se.on_rq || task_cpu(p) != rq_cpu)
|
||||
continue;
|
||||
|
||||
print_task(m, rq, p, now);
|
||||
print_task(m, rq, p);
|
||||
} while_each_thread(g, p);
|
||||
|
||||
read_unlock_irq(&tasklist_lock);
|
||||
@@ -106,7 +106,7 @@ print_cfs_rq_runtime_sum(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
|
||||
(long long)wait_runtime_rq_sum);
|
||||
}
|
||||
|
||||
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq, u64 now)
|
||||
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
|
||||
{
|
||||
SEQ_printf(m, "\ncfs_rq %p\n", cfs_rq);
|
||||
|
||||
@@ -124,7 +124,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq, u64 now)
|
||||
print_cfs_rq_runtime_sum(m, cpu, cfs_rq);
|
||||
}
|
||||
|
||||
static void print_cpu(struct seq_file *m, int cpu, u64 now)
|
||||
static void print_cpu(struct seq_file *m, int cpu)
|
||||
{
|
||||
struct rq *rq = &per_cpu(runqueues, cpu);
|
||||
|
||||
@@ -166,9 +166,9 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
|
||||
P(cpu_load[4]);
|
||||
#undef P
|
||||
|
||||
print_cfs_stats(m, cpu, now);
|
||||
print_cfs_stats(m, cpu);
|
||||
|
||||
print_rq(m, rq, cpu, now);
|
||||
print_rq(m, rq, cpu);
|
||||
}
|
||||
|
||||
static int sched_debug_show(struct seq_file *m, void *v)
|
||||
@@ -184,7 +184,7 @@ static int sched_debug_show(struct seq_file *m, void *v)
|
||||
SEQ_printf(m, "now at %Lu nsecs\n", (unsigned long long)now);
|
||||
|
||||
for_each_online_cpu(cpu)
|
||||
print_cpu(m, cpu, now);
|
||||
print_cpu(m, cpu);
|
||||
|
||||
SEQ_printf(m, "\n");
|
||||
|
||||
|
||||
+97
-117
File diff suppressed because it is too large
Load Diff
@@ -13,7 +13,7 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p)
|
||||
resched_task(rq->idle);
|
||||
}
|
||||
|
||||
static struct task_struct *pick_next_task_idle(struct rq *rq, u64 now)
|
||||
static struct task_struct *pick_next_task_idle(struct rq *rq)
|
||||
{
|
||||
schedstat_inc(rq, sched_goidle);
|
||||
|
||||
@@ -25,7 +25,7 @@ static struct task_struct *pick_next_task_idle(struct rq *rq, u64 now)
|
||||
* message if some code attempts to do it:
|
||||
*/
|
||||
static void
|
||||
dequeue_task_idle(struct rq *rq, struct task_struct *p, int sleep, u64 now)
|
||||
dequeue_task_idle(struct rq *rq, struct task_struct *p, int sleep)
|
||||
{
|
||||
spin_unlock_irq(&rq->lock);
|
||||
printk(KERN_ERR "bad: scheduling from the idle thread!\n");
|
||||
@@ -33,15 +33,15 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int sleep, u64 now)
|
||||
spin_lock_irq(&rq->lock);
|
||||
}
|
||||
|
||||
static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, u64 now)
|
||||
static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
|
||||
{
|
||||
}
|
||||
|
||||
static int
|
||||
static unsigned long
|
||||
load_balance_idle(struct rq *this_rq, int this_cpu, struct rq *busiest,
|
||||
unsigned long max_nr_move, unsigned long max_load_move,
|
||||
struct sched_domain *sd, enum cpu_idle_type idle,
|
||||
int *all_pinned, unsigned long *total_load_moved)
|
||||
int *all_pinned, int *this_best_prio)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
+16
-32
@@ -7,7 +7,7 @@
|
||||
* Update the current task's runtime statistics. Skip current tasks that
|
||||
* are not in our scheduling class.
|
||||
*/
|
||||
static inline void update_curr_rt(struct rq *rq, u64 now)
|
||||
static inline void update_curr_rt(struct rq *rq)
|
||||
{
|
||||
struct task_struct *curr = rq->curr;
|
||||
u64 delta_exec;
|
||||
@@ -15,18 +15,17 @@ static inline void update_curr_rt(struct rq *rq, u64 now)
|
||||
if (!task_has_rt_policy(curr))
|
||||
return;
|
||||
|
||||
delta_exec = now - curr->se.exec_start;
|
||||
delta_exec = rq->clock - curr->se.exec_start;
|
||||
if (unlikely((s64)delta_exec < 0))
|
||||
delta_exec = 0;
|
||||
|
||||
schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
|
||||
|
||||
curr->se.sum_exec_runtime += delta_exec;
|
||||
curr->se.exec_start = now;
|
||||
curr->se.exec_start = rq->clock;
|
||||
}
|
||||
|
||||
static void
|
||||
enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
|
||||
static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
|
||||
{
|
||||
struct rt_prio_array *array = &rq->rt.active;
|
||||
|
||||
@@ -37,12 +36,11 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
|
||||
/*
|
||||
* Adding/removing a task to/from a priority array:
|
||||
*/
|
||||
static void
|
||||
dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep, u64 now)
|
||||
static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
|
||||
{
|
||||
struct rt_prio_array *array = &rq->rt.active;
|
||||
|
||||
update_curr_rt(rq, now);
|
||||
update_curr_rt(rq);
|
||||
|
||||
list_del(&p->run_list);
|
||||
if (list_empty(array->queue + p->prio))
|
||||
@@ -75,7 +73,7 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p)
|
||||
resched_task(rq->curr);
|
||||
}
|
||||
|
||||
static struct task_struct *pick_next_task_rt(struct rq *rq, u64 now)
|
||||
static struct task_struct *pick_next_task_rt(struct rq *rq)
|
||||
{
|
||||
struct rt_prio_array *array = &rq->rt.active;
|
||||
struct task_struct *next;
|
||||
@@ -89,14 +87,14 @@ static struct task_struct *pick_next_task_rt(struct rq *rq, u64 now)
|
||||
queue = array->queue + idx;
|
||||
next = list_entry(queue->next, struct task_struct, run_list);
|
||||
|
||||
next->se.exec_start = now;
|
||||
next->se.exec_start = rq->clock;
|
||||
|
||||
return next;
|
||||
}
|
||||
|
||||
static void put_prev_task_rt(struct rq *rq, struct task_struct *p, u64 now)
|
||||
static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
|
||||
{
|
||||
update_curr_rt(rq, now);
|
||||
update_curr_rt(rq);
|
||||
p->se.exec_start = 0;
|
||||
}
|
||||
|
||||
@@ -172,28 +170,15 @@ static struct task_struct *load_balance_next_rt(void *arg)
|
||||
return p;
|
||||
}
|
||||
|
||||
static int
|
||||
static unsigned long
|
||||
load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
|
||||
unsigned long max_nr_move, unsigned long max_load_move,
|
||||
struct sched_domain *sd, enum cpu_idle_type idle,
|
||||
int *all_pinned, unsigned long *load_moved)
|
||||
int *all_pinned, int *this_best_prio)
|
||||
{
|
||||
int this_best_prio, best_prio, best_prio_seen = 0;
|
||||
int nr_moved;
|
||||
struct rq_iterator rt_rq_iterator;
|
||||
|
||||
best_prio = sched_find_first_bit(busiest->rt.active.bitmap);
|
||||
this_best_prio = sched_find_first_bit(this_rq->rt.active.bitmap);
|
||||
|
||||
/*
|
||||
* Enable handling of the case where there is more than one task
|
||||
* with the best priority. If the current running task is one
|
||||
* of those with prio==best_prio we know it won't be moved
|
||||
* and therefore it's safe to override the skip (based on load)
|
||||
* of any task we find with that prio.
|
||||
*/
|
||||
if (busiest->curr->prio == best_prio)
|
||||
best_prio_seen = 1;
|
||||
unsigned long load_moved;
|
||||
|
||||
rt_rq_iterator.start = load_balance_start_rt;
|
||||
rt_rq_iterator.next = load_balance_next_rt;
|
||||
@@ -203,11 +188,10 @@ load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
|
||||
rt_rq_iterator.arg = busiest;
|
||||
|
||||
nr_moved = balance_tasks(this_rq, this_cpu, busiest, max_nr_move,
|
||||
max_load_move, sd, idle, all_pinned, load_moved,
|
||||
this_best_prio, best_prio, best_prio_seen,
|
||||
&rt_rq_iterator);
|
||||
max_load_move, sd, idle, all_pinned, &load_moved,
|
||||
this_best_prio, &rt_rq_iterator);
|
||||
|
||||
return nr_moved;
|
||||
return load_moved;
|
||||
}
|
||||
|
||||
static void task_tick_rt(struct rq *rq, struct task_struct *p)
|
||||
|
||||
Reference in New Issue
Block a user