We always use resched_task() with rq->curr argument.
It's not possible to reschedule any task but rq's current.
The patch introduces resched_curr(struct rq *) to
replace all of the repeating patterns. The main aim
is cleanup, but there is a little size profit too:
(before)
$ size kernel/sched/built-in.o
text data bss dec hex filename
155274 16445 7042 178761 2ba49 kernel/sched/built-in.o
$ size vmlinux
text data bss dec hex filename
7411490 1178376 991232 9581098 92322a vmlinux
(after)
$ size kernel/sched/built-in.o
text data bss dec hex filename
155130 16445 7042 178617 2b9b9 kernel/sched/built-in.o
$ size vmlinux
text data bss dec hex filename
7411362 1178376 991232 9580970 9231aa vmlinux
I was choosing between resched_curr() and resched_rq(),
and the first name looks better for me.
A little lie in Documentation/trace/ftrace.txt. I have not
actually collected the tracing again. With a hope the patch
won't make execution times much worse :)
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We kill rq->rd on the CPU_DOWN_PREPARE stage:
cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
-> cpu_attach_domain -> rq_attach_root -> set_rq_offline
This unthrottles all throttled cfs_rqs.
But the cpu is still able to call schedule() till
take_cpu_down->__cpu_disable()
is called from stop_machine.
This case the tasks from just unthrottled cfs_rqs are pickable
in a standard scheduler way, and they are picked by dying cpu.
The cfs_rqs becomes throttled again, and migrate_tasks()
in migration_call skips their tasks (one more unthrottle
in migrate_tasks()->CPU_DYING does not happen, because rq->rd
is already NULL).
Patch sets runtime_enabled to zero. This guarantees, the runtime
is not accounted, and the cfs_rqs won't exceed given
cfs_rq->runtime_remaining = 1, and tasks will be pickable
in migrate_tasks(). runtime_enabled is recalculated again
when rq becomes online again.
Ben Segall also noticed, we always enable runtime in
tg_set_cfs_bandwidth(). Actually, we should do that for online
cpus only. To prevent races with unthrottle_offline_cfs_rqs()
we take get_online_cpus() lock.
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
CC: Konstantin Khorenko <khorenko@parallels.com>
CC: Paul Turner <pjt@google.com>
CC: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403684382.3462.42.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reading through the scan period code and comment, it appears the
intent was to slow down NUMA scanning when a majority of accesses
are on the local node, specifically a local:remote ratio of 3:1.
However, the code actually tests local / (local + remote), and
the actual cut-off point was around 30% local accesses, well before
a task has actually converged on a node.
Changing the threshold to 7 means scanning slows down when a task
has around 70% of its accesses local, which appears to match the
intent of the code more closely.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-8-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Running "perf bench numa mem -0 -m -P 1000 -p 8 -t 20" on a 4
node system results in 160 runnable threads on a system with 80
CPU threads.
Once a process has nearly converged, with 39 threads on one node
and 1 thread on another node, the remaining thread will be unable
to migrate to its preferred node through a task swap.
However, a simple task move would make the workload converge,
witout causing an imbalance.
Test for this unlikely occurrence, and attempt a task move to
the preferred nid when it happens.
# Running main, "perf bench numa mem -p 8 -t 20 -0 -m -P 1000"
###
# 160 tasks will execute (on 4 nodes, 80 CPUs):
# -1x 0MB global shared mem operations
# -1x 1000MB process shared mem operations
# -1x 0MB thread local mem operations
###
###
#
# 0.0% [0.2 mins] 0/0 1/1 36/2 0/0 [36/3 ] l: 0-0 ( 0) {0-2}
# 0.0% [0.3 mins] 43/3 37/2 39/2 41/3 [ 6/10] l: 0-1 ( 1) {1-2}
# 0.0% [0.4 mins] 42/3 38/2 40/2 40/2 [ 4/9 ] l: 1-2 ( 1) [50.0%] {1-2}
# 0.0% [0.6 mins] 41/3 39/2 40/2 40/2 [ 2/9 ] l: 2-4 ( 2) [50.0%] {1-2}
# 0.0% [0.7 mins] 40/2 40/2 40/2 40/2 [ 0/8 ] l: 3-5 ( 2) [40.0%] ( 41.8s converged)
Without this patch, this same perf bench numa mem run had to
rely on the scheduler load balancer to first balance out the
load (moving a random task), before a task swap could complete
the NUMA convergence.
The load balancer does not normally take action unless the load
difference exceeds 25%. Convergence times of over half an hour
have been observed without this patch.
With this patch, the NUMA balancing code will simply migrate the
task, if that does not cause an imbalance.
Also skip examining a CPU in detail if the improvement on that CPU
is no more than the best we already have.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ggthh0rnh0yua6o5o3p6cr1o@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places
on a CPU is determined by the group the task is in. The active groups
on the source and destination CPU can be different, resulting in a
different load contribution by the same task at its source and at its
destination. As a result, the load needs to be calculated separately
for each CPU, instead of estimated once with task_h_load().
Getting this calculation right allows some workloads to converge,
where previously the last thread could get stuck on another node,
without being able to migrate to its final destination.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the NUMA code scales the load on each node with the
amount of CPU power available on that node, but it does not
apply any adjustment to the load of the task that is being
moved over.
On systems with SMT/HT, this results in a task being weighed
much more heavily than a CPU core, and a task move that would
even out the load between nodes being disallowed.
The correct thing is to apply the power correction to the
numbers after we have first applied the move of the tasks'
loads to them.
This also allows us to do the power correction with a multiplication,
rather than a division.
Also drop two function arguments for load_too_unbalanced, since it
takes various factors from env already.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a system is lightly loaded (i.e. no more than 1 job per cpu),
attempt to pull job to a cpu before putting it to idle is unnecessary and
can be skipped. This patch adds an indicator so the scheduler can know
when there's no more than 1 active job is on any CPU in the system to
skip needless job pulls.
On a 4 socket machine with a request/response kind of workload from
clients, we saw about 0.13 msec delay when we go through a full load
balance to try pull job from all the other cpus. While 0.1 msec was
spent on processing the request and generating a response, the 0.13 msec
load balance overhead was actually more than the actual work being done.
This overhead can be skipped much of the time for lightly loaded systems.
With this patch, we tested with a netperf request/response workload that
has the server busy with half the cpus in a 4 socket system. We found
the patch eliminated 75% of the load balance attempts before idling a cpu.
The overhead of setting/clearing the indicator is low as we already gather
the necessary info while we call add_nr_running() and update_sd_lb_stats.()
We switch to full load balance load immediately if any cpu got more than
one job on its run queue in add_nr_running. We'll clear the indicator
to avoid load balance when we detect no cpu's have more than one job
when we scan the work queues in update_sg_lb_stats(). We are aggressive
in turning on the load balance and opportunistic in skipping the load
balance.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jason Low <jason.low2@hp.com>
Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK
Signed-off-by: Ingo Molnar <mingo@kernel.org>
distribute_cfs_runtime() intentionally only hands out enough runtime to
bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take
the runtime they need only once they actually get to run. However, if
they get to run sufficiently quickly, the period timer is still in
distribute_cfs_runtime() and no runtime is available, causing them to
throttle. Then distribute has to handle them again, and this can go on
until distribute has handed out all of the runtime 1ns at a time, which
takes far too long.
Instead allow access to the same runtime that distribute is handing out,
accepting that corner cases with very low quota may be able to spend the
entire cfs_b->runtime during distribute_cfs_runtime, meaning that the
runtime directly handed out by distribute_cfs_runtime was over quota. In
addition, if a cfs_rq does manage to throttle like this, make sure the
existing distribute_cfs_runtime no longer loops over it again.
Signed-off-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140620222120.13814.21652.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The first thing task_numa_migrate() does is check to see if there is
CPU capacity available on the preferred node, in order to move the
task there.
However, if the preferred node is all busy, we would skip considering
that node for tasks swaps in the subsequent loop. This prevents NUMA
convergence of tasks on busy systems.
However, swapping locations with a task on our preferred nid, when
the preferred nid is busy, is perfectly fine.
The fix is to also look for a CPU on our preferred nid when it is
totally busy.
This changes "perf bench numa mem -p 4 -t 20 -m -0 -P 1000" from
not converging in 15 minutes on my 4 node system, to converging in
10-20 seconds.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140604160942.6969b101@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull more scheduler updates from Ingo Molnar:
"Second round of scheduler changes:
- try-to-wakeup and IPI reduction speedups, from Andy Lutomirski
- continued power scheduling cleanups and refactorings, from Nicolas
Pitre
- misc fixes and enhancements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Delete extraneous extern for to_ratio()
sched/idle: Optimize try-to-wake-up IPI
sched/idle: Simplify wake_up_idle_cpu()
sched/idle: Clear polling before descheduling the idle thread
sched, trace: Add a tracepoint for IPI-less remote wakeups
cpuidle: Set polling in poll_idle
sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)
sched: Rename capacity related flags
sched: Final power vs. capacity cleanups
sched: Remove remaining dubious usage of "power"
sched: Let 'struct sched_group_power' care about CPU capacity
sched/fair: Disambiguate existing/remaining "capacity" usage
sched/fair: Change "has_capacity" to "has_free_capacity"
sched/fair: Remove "power" from 'struct numa_stats'
sched: Fix signedness bug in yield_to()
sched/fair: Use time_after() in record_wakee()
sched/balancing: Reduce the rate of needless idle load balancing
sched/fair: Fix unlocked reads of some cfs_b->quota/period
This function is supposed to return true if the new load imbalance is
worse than the old one. It didn't. I can only hope brown paper bags
are in style.
Now things converge much better on both the 4 node and 8 node systems.
I am not sure why this did not seem to impact specjbb performance on the
4 node system, which is the system I have full-time access to.
This bug was introduced recently, with commit e63da03639 ("sched/numa:
Allow task switch if load imbalance improves")
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.
* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...