tickless idle has a negative side effect on update_cpu_load(), which
in turn can affect load balancing behavior.
update_cpu_load() is supposed to be called every tick, to keep track
of various load indicies. With tickless idle, there are no scheduler
ticks called on the idle CPUs. Idle CPUs may still do load balancing
(with idle_load_balance CPU) using the stale cpu_load. It will also
cause problems when all CPUs go idle for a while and become active
again. In this case loads would not degrade as expected.
This is how rq->nr_load_updates change looks like under different
conditions:
<cpu_num> <nr_load_updates change>
All CPUS idle for 10 seconds (HZ=1000)
0 1621
10 496
11 139
12 875
13 1672
14 12
15 21
1 1472
2 2426
3 1161
4 2108
5 1525
6 701
7 249
8 766
9 1967
One CPU busy rest idle for 10 seconds
0 10003
10 601
11 95
12 966
13 1597
14 114
15 98
1 3457
2 93
3 6679
4 1425
5 1479
6 595
7 193
8 633
9 1687
All CPUs busy for 10 seconds
0 10026
10 10026
11 10026
12 10026
13 10025
14 10025
15 10025
1 10026
2 10026
3 10026
4 10026
5 10026
6 10026
7 10026
8 10026
9 10026
That is update_cpu_load works properly only when all CPUs are busy.
If all are idle, all the CPUs get way lower updates. And when few
CPUs are busy and rest are idle, only busy and ilb CPU does proper
updates and rest of the idle CPUs will do lower updates.
The patch keeps track of when a last update was done and fixes up
the load avg based on current time.
On one of my test system SPECjbb with warehouse 1..numcpus, patch
improves throughput numbers by ~1% (average of 6 runs). On another
test system (with different domain hierarchy) there is no noticable
change in perf.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <AANLkTilLtDWQsAUrIxJ6s04WTgmw9GuOODc5AOrYsaR5@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- Contrary to what 6d558c3a says, there is no need to reload
prev = rq->curr after the context switch. You always schedule
back to where you came from, prev must be equal to current
even if cpu/rq was changed.
- This also means reacquire_kernel_lock() can use prev instead
of current.
- No need to reassign switch_count if reacquire_kernel_lock()
reports need_resched(), we can just move the initial assignment
down, under the "need_resched_nonpreemptible:" label.
- Try to update the comment after context_switch().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100519125711.GA30199@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For people who otherwise get to write: cpu_clock(smp_processor_id()),
there is now: local_clock().
Also, as per suggestion from Andrew, provide some documentation on
the various clock interfaces, and minimize the unsigned long long vs
u64 mess.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
LKML-Reference: <1275052414.1645.52.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Concurrency managed workqueue needs to know when workers are going to
sleep and waking up. Using these two hooks, cmwq keeps track of the
current concurrency level and throttles execution of new works if it's
too high and wakes up another worker from the sleep hook if it becomes
too low.
This patch introduces PF_WQ_WORKER to identify workqueue workers and
adds the following two hooks.
* wq_worker_waking_up(): called when a worker is woken up.
* wq_worker_sleeping(): called when a worker is going to sleep and may
return a pointer to a local task which should be woken up. The
returned task is woken up using try_to_wake_up_local() which is
simplified ttwu which is called under rq lock and can only wake up
local tasks.
Both hooks are currently defined as noop in kernel/workqueue_sched.h.
Later cmwq implementation will replace them with proper
implementation.
These hooks are hard coded as they'll always be enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
The factoring out doesn't affect try_to_wake_up() much
code-generation-wise. Depending on configuration options, it ends up
generating the same object code as before or slightly different one
due to different register assignment.
This is to help future implementation of try_to_wake_up_local().
Mike Galbraith suggested rename to ttwu_post_activation() from
ttwu_woken_up() and comment update in try_to_wake_up().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Currently, when a cpu goes down, cpu_active is cleared before
CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
default priority cpu notifier. When a cpu is coming up, it's set
before CPU_ONLINE but cpuset configuration again is updated from the
same cpu notifier.
For cpu notifiers, this presents an inconsistent state. Threads which
a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
migrated to other cpus because the cpu is no more inactive.
Fix it by updating cpu_active in the highest priority cpu notifier and
cpuset configuration in the second highest when a cpu is coming up.
Down path is updated similarly. This guarantees that all other cpu
notifiers see consistent cpu_active and cpuset configuration.
cpuset_track_online_cpus() notifier is converted to
cpuset_update_active_cpus() which just updates the configuration and
now called from cpuset_cpu_[in]active() notifiers registered from
sched_init_smp(). If cpuset is disabled, cpuset_update_active_cpus()
degenerates into partition_sched_domains() making separate notifier
for !CONFIG_CPUSETS unnecessary.
This problem is triggered by cmwq. During CPU_DOWN_PREPARE, hotplug
callback creates a kthread and kthread_bind()s it to the target cpu,
and the thread is expected to run on that cpu.
* Ingo's test discovered __cpuinit/exit markups were incorrect.
Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Menage <menage@google.com>
PROVE_RCU has a few issues with the cpu_cgroup because the scheduler
typically holds rq->lock around the css rcu derefs but the generic
cgroup code doesn't (and can't) know about that lock.
Provide means to add extra checks to the css dereference and use that
in the scheduler to annotate its users.
The addition of rq->lock to these checks is correct because the
cgroup_subsys::attach() method takes the rq->lock for each task it
moves, therefore by holding that lock, we ensure the task is pinned to
the current cgroup and the RCU derefence is valid.
That leaves one genuine race in __sched_setscheduler() where we used
task_group() without holding any of the required locks and thus raced
with the cgroup code. Solve this by moving the check under the
appropriate lock.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
module: fix bne2 "gave up waiting for init of module libcrc32c"
module: verify_export_symbols under the lock
module: move find_module check to end
module: make locking more fine-grained.
module: Make module sysfs functions private.
module: move sysfs exposure to end of load_module
module: fix kdb's illicit use of struct module_use.
module: Make the 'usage' lists be two-way
Problem: it's hard to avoid an init routine stumbling over a
request_module these days. And it's not clear it's always a bad idea:
for example, a module like kvm with dynamic dependencies on kvm-intel
or kvm-amd would be neater if it could simply request_module the right
one.
In this particular case, it's libcrc32c:
libcrc32c_mod_init
crypto_alloc_shash
crypto_alloc_tfm
crypto_find_alg
crypto_alg_mod_lookup
crypto_larval_lookup
request_module
If another module is waiting inside resolve_symbol() for libcrc32c to
finish initializing (ie. bne2 depends on libcrc32c) then it does so
holding the module lock, and our request_module() can't make progress
until that is released.
Waiting inside resolve_symbol() without the lock isn't all that hard:
we just need to pass the -EBUSY up the call chain so we can sleep
where we don't hold the lock. Error reporting is a bit trickier: we
need to copy the name of the unfinished module before releasing the
lock.
Other notes:
1) This also fixes a theoretical issue where a weak dependency would allow
symbol version mismatches to be ignored.
2) We rename use_module to ref_module to make life easier for the only
external user (the out-of-tree ksplice patches).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tim Abbot <tabbott@ksplice.com>
Tested-by: Brandon Philips <bphilips@suse.de>
It disabled preempt so it was "safe", but nothing stops another module
slipping in before this module is added to the global list now we don't
hold the lock the whole time.
So we check this just after we check for duplicate modules, and just
before we put the module in the global list.
(find_symbol finds symbols in coming and going modules, too).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I think Rusty may have made the lock a bit _too_ finegrained there, and
didn't add it to some places that needed it. It looks, for example, like
PATCH 1/2 actually drops the lock in places where it's needed
("find_module()" is documented to need it, but now load_module() didn't
hold it at all when it did the find_module()).
Rather than adding a new "module_loading" list, I think we should be able
to just use the existing "modules" list, and just fix up the locking a
bit.
In fact, maybe we could just move the "look up existing module" a bit
later - optimistically assuming that the module doesn't exist, and then
just undoing the work if it turns out that we were wrong, just before
adding ourselves to the list.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Kay Sievers <kay.sievers@vrfy.org> reports that we still have some
contention over module loading which is slowing boot.
Linus also disliked a previous "drop lock and regrab" patch to fix the
bne2 "gave up waiting for init of module libcrc32c" message.
This is more ambitious: we only grab the lock where we need it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Brandon Philips <brandon@ifup.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
These were placed in the header in ef665c1a06 to get the various
SYSFS/MODULE config combintations to compile.
That may have been necessary then, but it's not now. These functions
are all local to module.c.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
This means a little extra work, but is more logical: we don't put
anything in sysfs until we're about to put the module into the
global list an parse its parameters.
This also gives us a logical place to put duplicate module detection
in the next patch.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When adding a module that depends on another one, we used to create a
one-way list of "modules_which_use_me", so that module unloading could
see who needs a module.
It's actually quite simple to make that list go both ways: so that we
not only can see "who uses me", but also see a list of modules that are
"used by me".
In fact, we always wanted that list in "module_unload_free()": when we
unload a module, we want to also release all the other modules that are
used by that module. But because we didn't have that list, we used to
first iterate over all modules, and then iterate over each "used by me"
list of that module.
By making the list two-way, we simplify module_unload_free(), and it
allows for some trivial fixes later too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned & rebased)
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
block: make blk_init_free_list and elevator_init idempotent
block: avoid unconditionally freeing previously allocated request_queue
pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface
pipe: change the privilege required for growing a pipe beyond system max
pipe: adjust minimum pipe size to 1 page
block: disable preemption before using sched_clock()
cciss: call BUG() earlier
Preparing 8.3.8rc2
drbd: Reduce verbosity
drbd: use drbd specific ratelimit instead of global printk_ratelimit
drbd: fix hang on local read errors while disconnected
drbd: Removed the now empty w_io_error() function
drbd: removed duplicated #includes
drbd: improve usage of MSG_MORE
drbd: need to set socket bufsize early to take effect
drbd: improve network latency, TCP_QUICKACK
drbd: Revert "drbd: Create new current UUID as late as possible"
brd: support discard
Revert "writeback: fix WB_SYNC_NONE writeback from umount"
Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync"
...
The commit 80b5184cc5 ("kernel/: convert cpu
notifier to return encapsulate errno value") changed the return value of
cpu notifier callbacks.
Those callbacks don't return NOTIFY_BAD on failures anymore. But there
are a few callbacks which are called directly at init time and checking
the return value.
I forgot to change BUG_ON checking by the direct callers in the commit.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Child groups should have a greater depth than their parents. Prior to
this change, the parent would incorrectly report zero memory usage for
child cgroups when use_hierarchy is enabled.
test script:
mount -t cgroup none /cgroups -o memory
cd /cgroups
mkdir cg1
echo 1 > cg1/memory.use_hierarchy
mkdir cg1/cg11
echo $$ > cg1/cg11/tasks
dd if=/dev/zero of=/tmp/foo bs=1M count=1
echo
echo CHILD
grep cache cg1/cg11/memory.stat
echo
echo PARENT
grep cache cg1/memory.stat
echo $$ > tasks
rmdir cg1/cg11 cg1
cd /
umount /cgroups
Using fae9c79, a recent patch that changed alloc_css_id() depth computation,
the parent incorrectly reports zero usage:
root@ubuntu:~# ./test
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0151844 s, 69.1 MB/s
CHILD
cache 1048576
total_cache 1048576
PARENT
cache 0
total_cache 0
With this patch, the parent correctly includes child usage:
root@ubuntu:~# ./test
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0136827 s, 76.6 MB/s
CHILD
cache 1052672
total_cache 1052672
PARENT
cache 0
total_cache 1052672
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: <stable@kernel.org> [2.6.34.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_struct->pesonality is "unsigned int", but sys_personality() paths use
"unsigned long pesonality". This means that every assignment or
comparison is not right. In particular, if this argument does not fit
into "unsigned int" __set_personality() changes the caller's personality
and then sys_personality() returns -EINVAL.
Turn this argument into "unsigned int" and avoid overflows. Obviously,
this is the user-visible change, we just ignore the upper bits. But this
can't break the sane application.
There is another thing which can confuse the poorly written applications.
User-space thinks that this syscall returns int, not long. This means
that the returned value can be negative and look like the error code. But
note that libc won't be confused and thus errno won't be set, and with
this patch the user-space can never get -1 unless sys_personality() really
fails. And, most importantly, the negative RET != -1 is only possible if
that app previously called personality(RET).
Pointed-out-by: Wenming Zhang <wezhang@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched, trace: Fix sched_switch() prev_state argument
sched: Fix wake_affine() vs RT tasks
sched: Make sure timers have migrated before killing the migration_thread
* 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Fix crash in swevents
perf buildid-list: Fix --with-hits event processing
perf scripts python: Give field dict to unhandled callback
perf hist: fix objdump output parsing
perf-record: Check correct pid when forking
perf: Do the comm inheritance per thread in event__process_task
perf: Use event__process_task from perf sched
perf: Process comm events by tid
blktrace: Fix new kernel-doc warnings
perf_events: Fix unincremented buffer base on partial copy
perf_events: Fix event scheduling issues introduced by transactional API
perf_events, trace: Fix perf_trace_destroy(), mutex went missing
perf_events, trace: Fix probe unregister race
perf_events: Fix races in group composition
perf_events: Fix races and clean up perf_event and perf_mmap_data interaction
Frederic reported that because swevents handling doesn't disable IRQs
anymore, we can get a recursion of perf_adjust_period(), once from
overflow handling and once from the tick.
If both call ->disable, we get a double hlist_del_rcu() and trigger
a LIST_POISON2 dereference.
Since we don't actually need to stop/start a swevent to re-programm
the hardware (lack of hardware to program), simply nop out these
callbacks for the swevent pmu.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1275557609.27810.35218.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>