Use a private lock instead. It protects all per-cpu data structures in
workqueue.c, including the workqueues list.
Fix a bug in schedule_on_each_cpu(): it was forgetting to lock down the
per-cpu resources.
Unfixed long-standing bug: if someone unplugs the CPU identified by
`singlethread_cpu' the kernel will get very sick.
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We found this issue last week w/ the -RT kernel, but it seems the same
issue is in mainline as well.
Basically it is possible for futex_unlock_pi to return without actually
freeing the lock. This is due to buggy logic in the use of
futex_handle_fault() and its attempt argument in a failure case.
Looking at futex.c the logic is as follows:
1) In futex_unlock_pi() we start w/ ret=0 and we go down to the first
futex_atomic_cmpxchg_inatomic(), where we find uval==-EFAULT. We then
jump to the pi_faulted label.
2) From pi_faulted: We increment attempt, unlock the sem and hit the
retry label.
3) From the retry label, with ret still zero, we again hit EFAULT on the
first futex_atomic_cmpxchg_inatomic(), and again goto the pi_faulted
label.
4) Again from pi_faulted: we increment attempt and enter the
conditional, where we call futex_handle_fault.
5) futex_handle_fault fails, and we goto the out_unlock_release_sem
label.
6) From out_unlock_release_sem we return, and since ret is still zero,
we return without error, while never actually unlocking the lock.
Issue #1: at the first futex_atomic_cmpxchg_inatomic() we should probably
be setting ret=-EFAULT before jumping to pi_faulted: However in our case
this doesn't really affect anything, as the glibc we're using ignores the
error value from futex_unlock_pi().
Issue #2: Look at futex_handle_fault(), its first conditional will return
-EFAULT if attempt is >= 2. However, from the "if(attempt++)
futex_handle_fault(attempt)" logic above, we'll *never* call
futex_handle_fault when attempt is less then two. So we never get a chance
to even try to fault the page in.
The following patch addresses these two issues by 1) Always setting ret to
-EFAULT if futex_handle_fault fails, and 2) Removing the = in
futex_handle_fault's (attempt >= 2) check.
I'm really not sure this is the right fix, but wanted to bring it up so
folks knew the issue is alive and well in the current -git tree. From
looking at the git logs the logic was first introduced (then later copied
to other places) in the following commit almost a year ago:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4732efbeb997189d9f9b04708dc26bf8613ed721;hp=5b039e681b8c5f30aac9cc04385cc94be45d0823
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sys_getppid() optimization can access a freed memory. On kernels with
DEBUG_SLAB turned ON, this results in Oops. As Dave Hansen noted, this
optimization is also unsafe for memory hotplug.
So this patch always takes the lock to be safe.
[oleg@tv-sign.ru: simplifications]
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: <stable@kernel.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
kernel/panic.c: In function 'add_taint':
kernel/panic.c:176: warning: implicit declaration of function 'debug_locks_off'
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The percpu variable is used incorrectly in switch_hrtimer_base().
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The recent fixups in futex.c need to be applied to futex_compat.c too. Fixes
a hang reported by Olaf.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
find_next_system_ram() is used to find available memory resource at onlining
newly added memory. This patch fixes following problem.
find_next_system_ram() cannot catch this case.
Resource: (start)-------------(end)
Section : (start)-------------(end)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Keith Mannthey <kmannth@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
find_next_system_ram() returns valid memory range which meets requested area,
only used by memory-hot-add.
This function always rewrite requested resource even if returned area is not
fully fit in requested one. And sometimes the returnd resource is larger than
requested area. This annoyes the caller. This patch changes the returned
value to fit in requested area.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Keith Mannthey <kmannth@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reported by: Dave Jones
Whilst printk'ing to both console and serial console, I got this...
(2.6.18rc1)
BUG: sleeping function called from invalid context at kernel/sched.c:4438
in_atomic():0, irqs_disabled():1
Call Trace:
[<ffffffff80271db8>] show_trace+0xaa/0x23d
[<ffffffff80271f60>] dump_stack+0x15/0x17
[<ffffffff8020b9f8>] __might_sleep+0xb2/0xb4
[<ffffffff8029232e>] __cond_resched+0x15/0x55
[<ffffffff80267eb8>] cond_resched+0x3b/0x42
[<ffffffff80268c64>] console_conditional_schedule+0x12/0x14
[<ffffffff80368159>] fbcon_redraw+0xf6/0x160
[<ffffffff80369c58>] fbcon_scroll+0x5d9/0xb52
[<ffffffff803a43c4>] scrup+0x6b/0xd6
[<ffffffff803a4453>] lf+0x24/0x44
[<ffffffff803a7ff8>] vt_console_print+0x166/0x23d
[<ffffffff80295528>] __call_console_drivers+0x65/0x76
[<ffffffff80295597>] _call_console_drivers+0x5e/0x62
[<ffffffff80217e3f>] release_console_sem+0x14b/0x232
[<ffffffff8036acd6>] fb_flashcursor+0x279/0x2a6
[<ffffffff80251e3f>] run_workqueue+0xa8/0xfb
[<ffffffff8024e5e0>] worker_thread+0xef/0x122
[<ffffffff8023660f>] kthread+0x100/0x136
[<ffffffff8026419e>] child_rip+0x8/0x12
This can occur when release_console_sem() is called but the log
buffer still has contents that need to be flushed. The console drivers
are called while the console_may_schedule flag is still true. The
might_sleep() is triggered when fbcon calls console_conditional_schedule().
Fix by setting console_may_schedule to zero earlier, before the call to the
console drivers.
Signed-off-by: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When delivering PTRACE_EVENT_VFORK_DONE, provide pid of the child process
when tracer calls ptrace(PTRACE_GETEVENTMSG). This is already
(accidentally) available when the tracer is tracing VFORK in addition to
VFORK_DONE.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Daniel Jacobowitz <dan@debian.org>
Cc: Albert Cahalan <acahalan@gmail.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds a barrier() in futex unqueue_me to avoid aliasing of two
pointers.
On my s390x system I saw the following oops:
Unable to handle kernel pointer dereference at virtual kernel address
0000000000000000
Oops: 0004 [#1]
CPU: 0 Not tainted
Process mytool (pid: 13613, task: 000000003ecb6ac0, ksp: 00000000366bdbd8)
Krnl PSW : 0704d00180000000 00000000003c9ac2 (_spin_lock+0xe/0x30)
Krnl GPRS: 00000000ffffffff 000000003ecb6ac0 0000000000000000 0700000000000000
0000000000000000 0000000000000000 000001fe00002028 00000000000c091f
000001fe00002054 000001fe00002054 0000000000000000 00000000366bddc0
00000000005ef8c0 00000000003d00e8 0000000000144f91 00000000366bdcb8
Krnl Code: ba 4e 20 00 12 44 b9 16 00 3e a7 84 00 08 e3 e0 f0 88 00 04
Call Trace:
([<0000000000144f90>] unqueue_me+0x40/0xe4)
[<0000000000145a0c>] do_futex+0x33c/0xc40
[<000000000014643e>] sys_futex+0x12e/0x144
[<000000000010bb00>] sysc_noemu+0x10/0x16
[<000002000003741c>] 0x2000003741c
The code in question is:
static int unqueue_me(struct futex_q *q)
{
int ret = 0;
spinlock_t *lock_ptr;
/* In the common case we don't take the spinlock, which is nice. */
retry:
lock_ptr = q->lock_ptr;
if (lock_ptr != 0) {
spin_lock(lock_ptr);
/*
* q->lock_ptr can change between reading it and
* spin_lock(), causing us to take the wrong lock. This
* corrects the race condition.
[...]
and my compiler (gcc 4.1.0) makes the following out of it:
00000000000003c8 <unqueue_me>:
3c8: eb bf f0 70 00 24 stmg %r11,%r15,112(%r15)
3ce: c0 d0 00 00 00 00 larl %r13,3ce <unqueue_me+0x6>
3d0: R_390_PC32DBL .rodata+0x2a
3d4: a7 f1 1e 00 tml %r15,7680
3d8: a7 84 00 01 je 3da <unqueue_me+0x12>
3dc: b9 04 00 ef lgr %r14,%r15
3e0: a7 fb ff d0 aghi %r15,-48
3e4: b9 04 00 b2 lgr %r11,%r2
3e8: e3 e0 f0 98 00 24 stg %r14,152(%r15)
3ee: e3 c0 b0 28 00 04 lg %r12,40(%r11)
/* write q->lock_ptr in r12 */
3f4: b9 02 00 cc ltgr %r12,%r12
3f8: a7 84 00 4b je 48e <unqueue_me+0xc6>
/* if r12 is zero then jump over the code.... */
3fc: e3 20 b0 28 00 04 lg %r2,40(%r11)
/* write q->lock_ptr in r2 */
402: c0 e5 00 00 00 00 brasl %r14,402 <unqueue_me+0x3a>
404: R_390_PC32DBL _spin_lock+0x2
/* use r2 as parameter for spin_lock */
So the code becomes more or less:
if (q->lock_ptr != 0) spin_lock(q->lock_ptr)
instead of
if (lock_ptr != 0) spin_lock(lock_ptr)
Which caused the oops from above.
After adding a barrier gcc creates code without this problem:
[...] (the same)
3ee: e3 c0 b0 28 00 04 lg %r12,40(%r11)
3f4: b9 02 00 cc ltgr %r12,%r12
3f8: b9 04 00 2c lgr %r2,%r12
3fc: a7 84 00 48 je 48c <unqueue_me+0xc4>
400: c0 e5 00 00 00 00 brasl %r14,400 <unqueue_me+0x38>
402: R_390_PC32DBL _spin_lock+0x2
As a general note, this code of unqueue_me seems a bit fishy. The retry logic
of unqueue_me only works if we can guarantee, that the original value of
q->lock_ptr is always a spinlock (Otherwise we overwrite kernel memory). We
know that q->lock_ptr can change. I dont know what happens with the original
spinlock, as I am not an expert with the futex code.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@timesys.com>
Signed-off-by: Christian Borntraeger <borntrae@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It should be possible to suspend, either to RAM or to disk, if there's a
traced process that has just reached a breakpoint. However, this is a
special case, because its parent process might have been frozen already and
then we are unable to deliver the "freeze" signal to the traced process.
If this happens, it's better to cancel the freezing of the traced process.
Ref. http://bugzilla.kernel.org/show_bug.cgi?id=6787
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Michael C Thompson wrote: [Tue Aug 01 2006, 02:36:36PM EDT]
> The trigger for this oops is:
> # auditctl -a exit,always -S pread64 -F 'inode<1'
Setting the err value will fix it.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Always initialize the audit_inode_hash[] so we don't oops on list rules.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When an object is created via a symlink into an audited directory, audit misses
the event due to not having collected the inode data for the directory. Modify
__audit_inode_child() to copy the parent inode data if a parent wasn't found in
audit_names[].
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When the specified path is an existing file or when it is a symlink, audit
collects the wrong inode number, which causes it to miss the open() event.
Adding a second hook to the open() path fixes this.
Also add audit_copy_inode() to consolidate some code.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Suresh points out that commit b0423a0d9c
broke the semantics of a synchronous signal like SIGSEGV occurring
recursively inside its own handler handler (or, indeed, any other
context when the signal was blocked).
That was unintentional, and this fixes things up by reinstating the old
semantics, but without reverting the cleanups.
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kernel/timer.c defines a (per-cpu) pointer to tvec_base_t, but initializes
it using { &a_tvec_base_t }, which sparse warns about; change this to just
&a_tvec_base_t.
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In order to prevent Doc Rot, this patch adds a reference to the design
document for rtmutex.c in rtmutex.c. So when someone needs to update or
change the design of that file they will know that a document actually
exists that explains the design (helping them change it), and hopefully
that they will update the document if they too change the design.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The recent changes from irqtrace feature has added overheads to
local_bh_disable and local_bh_enable that reduces UDP performance across
x86_64 and IA64, even though IA64 does not support the irqtrace feature.
Patch in question is
[PATCH]lockdep: irqtrace subsystem, core
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=c
ommit;h=de30a2b355ea85350ca2f58f3b9bf4e5bc007986
Prior to this patch, local_bh_disable was a short macro. Now it is a
function which calls __local_bh_disable with added irq flags save and
restore. The irq flags save and restore were also added to
local_bh_enable, probably for injecting the trace irqs code.
This overhead is on the generic code path across all architectures. On a
IA_64 test machine (Itanium-2 1.6 GHz) running a benchmark like netperf's
UDP streaming test, the added overhead results in a drop of 3% in
throughput, as udp_sendmsg calls the local_bh_enable/disable several times.
Other workloads that have heavy usages of local_bh_enable/disable could
also be affected. The patch ideally should not have affected IA-64
performance as it does not have IRQ tracing support. A significant portion
of the overhead is in the added irq flags save and restore, which I think
is not needed if IRQ tracing is unused. A suggested patch is attached
below that recovers the lost performance. However, the "ifdef"s in the
patch are a bit ugly.
Signed-off-by: Tim Chen <tim.c.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initialize init task's pi_waiters plist. Otherwise cpu hotplug of cpu 0
might crash, since rt_mutex_getprio() accesses an uninitialized list head.
call chain which led to crash:
take_cpu_down
sched_idle_next
__setscheduler
rt_mutex_getprio
Using PLIST_HEAD_INIT in the INIT_TASK macro doesn't work unfortunately,
since the pi_waiters member is only conditionally present.
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>