* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: fix init_idle()'s use of sched_clock()
sched: fix stale value in average load per task
We only need the cacheline padding on SMP kernels. Saves 6k:
text data bss dec hex filename
5713 388 8840 14941 3a5d kernel/kprobes.o
5713 388 2632 8733 221d kernel/kprobes.o
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__register_kprobe() can be preempted after checking probing address but
before module_text_address() or try_module_get(), and in this interval
the module can be unloaded. In that case, try_module_get(probed_mod)
will access to invalid address, or kprobe will probe invalid address.
This patch uses preempt_disable() to protect it and uses
__module_text_address() and __kernel_text_address().
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this change, control file 'freezer.state' doesn't exist in root
cgroup, making root cgroup unfreezable.
I think it's reasonable to disallow freeze tasks in the root cgroup. And
then we can avoid fork overhead when freezer subsystem is compiled but not
used.
Also make writing invalid value to freezer.state returns EINVAL rather
than EIO. This is more consistent with other cgroup subsystem.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In theory the task can be moved to another cgroup and the freezer will be
freed right after task_lock is dropped, so the lock results in zero
protection.
But in the case of freezer_fork() no lock is needed, since the task is not
in tasklist yet so it won't be moved to another cgroup, so task->cgroups
won't be changed or invalidated.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Maciej Rutecki reported:
> I have this bug during suspend to disk:
>
> [ 188.592151] Enabling non-boot CPUs ...
> [ 188.592151] SMP alternatives: switching to SMP code
> [ 188.666058] BUG: using smp_processor_id() in preemptible
> [00000000]
> code: suspend_to_disk/2934
> [ 188.666064] caller is native_sched_clock+0x2b/0x80
Which, as noted by Linus, was caused by me, via:
7cbaef9c "sched: optimize sched_clock() a bit"
Move the rq locking a bit earlier in the initialization sequence,
that will make the sched_clock() call in init_idle() non-preemptible.
Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix load balancer load average calculation accuracy
cpu_avg_load_per_task() returns a stale value when nr_running is 0.
It returns an older stale (caculated when nr_running was non zero) value.
This patch returns and sets rq->avg_load_per_task to zero when nr_running
is 0.
Compile and boot tested on a x86_64 box.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: handle HRTIMER_CB_IRQSAFE_UNLOCKED correctly from softirq context
nohz: disable tick_nohz_kick_tick() for now
irq: call __irq_enter() before calling the tick_idle_check
x86: HPET: enter hpet_interrupt_handler with interrupts disabled
x86: HPET: read from HPET_Tn_CMP() not HPET_T0_CMP
x86: HPET: convert WARN_ON to WARN_ON_ONCE
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: release buddies on yield
fix for account_group_exec_runtime(), make sure ->signal can't be freed under rq->lock
sched: clean up debug info
Clear buddies on yield, so that the buddy rules don't schedule them
despite them being placed right-most.
This fixed a performance regression with yield-happy binary JVMs.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Lin Ming <ming.m.lin@intel.com>
Impact: fix incorrect locking triggered during hotplug-intense stress-tests
While migrating the the CB_IRQSAFE_UNLOCKED timers during a cpu-offline,
we queue them on the cb_pending list, so that they won't go
stale.
Thus, when the callbacks of the timers run from the softirq context,
they could run into potential deadlocks, since these callbacks
assume that they're running with irq's disabled, thereby annoying
lockdep!
Fix this by emulating hardirq context while running these callbacks from
the hrtimer softirq.
=================================
[ INFO: inconsistent lock state ]
2.6.27 #2
--------------------------------
inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
ksoftirqd/0/4 [HC0[0]:SC1[1]:HE1:SE0] takes:
(&rq->lock){++..}, at: [<c011db84>] sched_rt_period_timer+0x9e/0x1fc
{in-hardirq-W} state was registered at:
[<c014103c>] __lock_acquire+0x549/0x121e
[<c0107890>] native_sched_clock+0x88/0x99
[<c013aa12>] clocksource_get_next+0x39/0x3f
[<c0139abc>] update_wall_time+0x616/0x7df
[<c0141d6b>] lock_acquire+0x5a/0x74
[<c0121724>] scheduler_tick+0x3a/0x18d
[<c047ed45>] _spin_lock+0x1c/0x45
[<c0121724>] scheduler_tick+0x3a/0x18d
[<c0121724>] scheduler_tick+0x3a/0x18d
[<c012c436>] update_process_times+0x3a/0x44
[<c013c044>] tick_periodic+0x63/0x6d
[<c013c062>] tick_handle_periodic+0x14/0x5e
[<c010568c>] timer_interrupt+0x44/0x4a
[<c0150c9f>] handle_IRQ_event+0x13/0x3d
[<c0151c14>] handle_level_irq+0x79/0xbd
[<c0105634>] do_IRQ+0x69/0x7d
[<c01041e4>] common_interrupt+0x28/0x30
[<c047007b>] aac_probe_one+0x1a3/0x3f3
[<c047ec2d>] _spin_unlock_irqrestore+0x36/0x39
[<c01512b4>] setup_irq+0x1be/0x1f9
[<c065d70b>] start_kernel+0x259/0x2c5
[<ffffffff>] 0xffffffff
irq event stamp: 50102
hardirqs last enabled at (50102): [<c047ebf4>] _spin_unlock_irq+0x20/0x23
hardirqs last disabled at (50101): [<c047edc2>] _spin_lock_irq+0xa/0x4b
softirqs last enabled at (50088): [<c0128ba6>] do_softirq+0x37/0x4d
softirqs last disabled at (50099): [<c0128ba6>] do_softirq+0x37/0x4d
other info that might help us debug this:
no locks held by ksoftirqd/0/4.
stack backtrace:
Pid: 4, comm: ksoftirqd/0 Not tainted 2.6.27 #2
[<c013f6cb>] print_usage_bug+0x13e/0x147
[<c013fef5>] mark_lock+0x493/0x797
[<c01410b1>] __lock_acquire+0x5be/0x121e
[<c0141d6b>] lock_acquire+0x5a/0x74
[<c011db84>] sched_rt_period_timer+0x9e/0x1fc
[<c047ed45>] _spin_lock+0x1c/0x45
[<c011db84>] sched_rt_period_timer+0x9e/0x1fc
[<c011db84>] sched_rt_period_timer+0x9e/0x1fc
[<c01210fd>] finish_task_switch+0x41/0xbd
[<c0107890>] native_sched_clock+0x88/0x99
[<c011dae6>] sched_rt_period_timer+0x0/0x1fc
[<c0136dda>] run_hrtimer_pending+0x54/0xe5
[<c011dae6>] sched_rt_period_timer+0x0/0x1fc
[<c0128afb>] __do_softirq+0x7b/0xef
[<c0128ba6>] do_softirq+0x37/0x4d
[<c0128c12>] ksoftirqd+0x56/0xc5
[<c0128bbc>] ksoftirqd+0x0/0xc5
[<c0134649>] kthread+0x38/0x5d
[<c0134611>] kthread+0x0/0x5d
[<c0104477>] kernel_thread_helper+0x7/0x10
=======================
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix hang/crash on ia64 under high load
This is ugly, but the simplest patch by far.
Unlike other similar routines, account_group_exec_runtime() could be
called "implicitly" from within scheduler after exit_notify(). This
means we can race with the parent doing release_task(), we can't just
check ->signal != NULL.
Change __exit_signal() to do spin_unlock_wait(&task_rq(tsk)->lock)
before __cleanup_signal() to make sure ->signal can't be freed under
task_rq(tsk)->lock. Note that task_rq_unlock_wait() doesn't care
about the case when tsk changes cpu/rq under us, this should be OK.
Thanks to Ingo who nacked my previous buggy patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Doug Chapman <doug.chapman@hp.com>
Impact: removal of unnecessary looping
The lockless part of the ring buffer allows for reentry into the code
from interrupts. A timestamp is taken, a test is preformed and if it
detects that an interrupt occurred that did tracing, it tries again.
The problem arises if the timestamp code itself causes a trace.
The detection will detect this and loop again. The difference between
this and an interrupt doing tracing, is that this will fail every time,
and cause an infinite loop.
Currently, we test if the loop happens 1000 times, and if so, it will
produce a warning and disable the ring buffer.
The problem with this approach is that it makes it difficult to perform
some types of tracing (tracing the timestamp code itself).
Each trace entry has a delta timestamp from the previous entry.
If a trace entry is reserved but and interrupt occurs and traces before
the previous entry is commited, the delta timestamp for that entry will
be zero. This actually makes sense in terms of tracing, because the
interrupt entry happened before the preempted entry was commited, so
one may consider the two happening at the same time. The order is
still preserved in the buffer.
With this idea, instead of trying to get a new timestamp if an interrupt
made it in between the timestamp and the test, the entry could simply
make the delta zero and continue. This will prevent interrupts or
tracers in the timer code from causing the above loop.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix for bug on resize
This patch addresses the bug found here:
http://bugzilla.kernel.org/show_bug.cgi?id=11996
When ftrace converted to the new unified trace buffer, the resizing of
the buffer was not protected as much as it was originally. If tracing
is performed while the resize occurs, then the buffer can be corrupted.
This patch disables all ftrace buffer modifications before a resize
takes place.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: nohz powersavings and wakeup regression
commit fb02fbc14d (NOHZ: restart tick
device from irq_enter()) causes a serious wakeup regression.
While the patch is correct it does not take into account that spurious
wakeups happen on x86. A fix for this issue is available, but we just
revert to the .27 behaviour and let long running softirqs screw
themself.
Disable it for now.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: avoid spurious ksoftirqd wakeups
The tick idle check which is called from irq_enter() was run before
the call to __irq_enter() which did not set the in_interrupt() bits in
preempt_count. That way the raise of a softirq woke up softirqd for
nothing as the softirq was handled on return from interrupt.
Call __irq_enter() before calling into the tick idle check code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up and fix debug info printout
While looking over the sched_debug code I noticed that we printed the rq
schedstats for every cfs_rq, ammend this.
Also change nr_spead_over into an int, and fix a little buglet in
min_vruntime printing.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'cpus4096' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
cpumask: introduce new API, without changing anything, v3
cpumask: new API, v2
cpumask: introduce new API, without changing anything
Impact: fix rare memory leak in the sched-domains manual reconfiguration code
In the failure path, rd is not attached to a sched domain,
so it causes a leak.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
Block: use round_jiffies_up()
Add round_jiffies_up and related routines
block: fix __blkdev_get() for removable devices
generic-ipi: fix the smp_mb() placement
blk: move blk_delete_timer call in end_that_request_last
block: add timer on blkdev_dequeue_request() not elv_next_request()
bio: define __BIOVEC_PHYS_MERGEABLE
block: remove unused ll_new_mergeable()
This fixes an oops when reading /proc/sched_debug.
A cgroup won't be removed completely until finishing cgroup_diput(), so we
shouldn't invalidate cgrp->dentry in cgroup_rmdir(). Otherwise, when a
group is being removed while cgroup_path() gets called, we may trigger
NULL dereference BUG.
The bug can be reproduced:
# cat test.sh
#!/bin/sh
mount -t cgroup -o cpu xxx /mnt
for (( ; ; ))
{
mkdir /mnt/sub
rmdir /mnt/sub
}
# ./test.sh &
# cat /proc/sched_debug
BUG: unable to handle kernel NULL pointer dereference at 00000038
IP: [<c045a47f>] cgroup_path+0x39/0x90
...
Call Trace:
[<c0420344>] ? print_cfs_rq+0x6e/0x75d
[<c0421160>] ? sched_debug_show+0x72d/0xc1e
...
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org> [2.6.26.x, 2.6.27.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>