cpuset_excl_nodes_overlap always returns 0 if current is exiting. This caused
customer's systems to panic in the OOM killer when processes were having
trouble getting memory for the final put_user in mm_release. Even though
there were lots of processes to kill.
Change to returning 1 in this case. This achieves parity with !CONFIG_CPUSETS
case, and was observed to fix the problem.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the list of cpus allowed to tasks in the top (root) cpuset to
dynamically track what cpus are online, using a CPU hotplug notifier. Make
this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support CPU hotplug, then these tasks cannot make use
of CPUs that are added after system boot, because the CPUs are not allowed
in the top cpuset. This is a surprising regression over earlier kernels
that didn't have cpusets enabled.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'cpus' file in the top (root) cpuset, making it read
only, and making it automatically track the value of cpu_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
by their cpuset.
Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
driving the fix, and earlier versions of this patch.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
futex_find_get_task:
if (p->state == EXIT_ZOMBIE || p->exit_state == EXIT_ZOMBIE)
return NULL;
I can't understand this. First, p->state can't be EXIT_ZOMBIE. The
->exit_state check looks strange too. Sub-threads or tasks whose ->parent
ignores SIGCHLD go directly to EXIT_DEAD state (I am ignoring a ptrace
case). Why EXIT_DEAD tasks should be ok? Yes, EXIT_ZOMBIE is more
important (a task may stay zombie for a long time), but this doesn't mean
we should explicitely ignore other EXIT_XXX states.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sched_setscheduler() looks at ->signal->rlim[]. It is unsafe do
dereference ->signal unless tasklist_lock or ->siglock is held (or p ==
current). We pin the task structure, but this can't prevent from
release_task()->__exit_signal() which sets ->signal = NULL.
Restore tasklist_lock across the setscheduler call.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use a private lock instead. It protects all per-cpu data structures in
workqueue.c, including the workqueues list.
Fix a bug in schedule_on_each_cpu(): it was forgetting to lock down the
per-cpu resources.
Unfixed long-standing bug: if someone unplugs the CPU identified by
`singlethread_cpu' the kernel will get very sick.
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We found this issue last week w/ the -RT kernel, but it seems the same
issue is in mainline as well.
Basically it is possible for futex_unlock_pi to return without actually
freeing the lock. This is due to buggy logic in the use of
futex_handle_fault() and its attempt argument in a failure case.
Looking at futex.c the logic is as follows:
1) In futex_unlock_pi() we start w/ ret=0 and we go down to the first
futex_atomic_cmpxchg_inatomic(), where we find uval==-EFAULT. We then
jump to the pi_faulted label.
2) From pi_faulted: We increment attempt, unlock the sem and hit the
retry label.
3) From the retry label, with ret still zero, we again hit EFAULT on the
first futex_atomic_cmpxchg_inatomic(), and again goto the pi_faulted
label.
4) Again from pi_faulted: we increment attempt and enter the
conditional, where we call futex_handle_fault.
5) futex_handle_fault fails, and we goto the out_unlock_release_sem
label.
6) From out_unlock_release_sem we return, and since ret is still zero,
we return without error, while never actually unlocking the lock.
Issue #1: at the first futex_atomic_cmpxchg_inatomic() we should probably
be setting ret=-EFAULT before jumping to pi_faulted: However in our case
this doesn't really affect anything, as the glibc we're using ignores the
error value from futex_unlock_pi().
Issue #2: Look at futex_handle_fault(), its first conditional will return
-EFAULT if attempt is >= 2. However, from the "if(attempt++)
futex_handle_fault(attempt)" logic above, we'll *never* call
futex_handle_fault when attempt is less then two. So we never get a chance
to even try to fault the page in.
The following patch addresses these two issues by 1) Always setting ret to
-EFAULT if futex_handle_fault fails, and 2) Removing the = in
futex_handle_fault's (attempt >= 2) check.
I'm really not sure this is the right fix, but wanted to bring it up so
folks knew the issue is alive and well in the current -git tree. From
looking at the git logs the logic was first introduced (then later copied
to other places) in the following commit almost a year ago:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4732efbeb997189d9f9b04708dc26bf8613ed721;hp=5b039e681b8c5f30aac9cc04385cc94be45d0823
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sys_getppid() optimization can access a freed memory. On kernels with
DEBUG_SLAB turned ON, this results in Oops. As Dave Hansen noted, this
optimization is also unsafe for memory hotplug.
So this patch always takes the lock to be safe.
[oleg@tv-sign.ru: simplifications]
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: <stable@kernel.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
kernel/panic.c: In function 'add_taint':
kernel/panic.c:176: warning: implicit declaration of function 'debug_locks_off'
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The percpu variable is used incorrectly in switch_hrtimer_base().
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The recent fixups in futex.c need to be applied to futex_compat.c too. Fixes
a hang reported by Olaf.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
find_next_system_ram() is used to find available memory resource at onlining
newly added memory. This patch fixes following problem.
find_next_system_ram() cannot catch this case.
Resource: (start)-------------(end)
Section : (start)-------------(end)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Keith Mannthey <kmannth@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
find_next_system_ram() returns valid memory range which meets requested area,
only used by memory-hot-add.
This function always rewrite requested resource even if returned area is not
fully fit in requested one. And sometimes the returnd resource is larger than
requested area. This annoyes the caller. This patch changes the returned
value to fit in requested area.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Keith Mannthey <kmannth@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reported by: Dave Jones
Whilst printk'ing to both console and serial console, I got this...
(2.6.18rc1)
BUG: sleeping function called from invalid context at kernel/sched.c:4438
in_atomic():0, irqs_disabled():1
Call Trace:
[<ffffffff80271db8>] show_trace+0xaa/0x23d
[<ffffffff80271f60>] dump_stack+0x15/0x17
[<ffffffff8020b9f8>] __might_sleep+0xb2/0xb4
[<ffffffff8029232e>] __cond_resched+0x15/0x55
[<ffffffff80267eb8>] cond_resched+0x3b/0x42
[<ffffffff80268c64>] console_conditional_schedule+0x12/0x14
[<ffffffff80368159>] fbcon_redraw+0xf6/0x160
[<ffffffff80369c58>] fbcon_scroll+0x5d9/0xb52
[<ffffffff803a43c4>] scrup+0x6b/0xd6
[<ffffffff803a4453>] lf+0x24/0x44
[<ffffffff803a7ff8>] vt_console_print+0x166/0x23d
[<ffffffff80295528>] __call_console_drivers+0x65/0x76
[<ffffffff80295597>] _call_console_drivers+0x5e/0x62
[<ffffffff80217e3f>] release_console_sem+0x14b/0x232
[<ffffffff8036acd6>] fb_flashcursor+0x279/0x2a6
[<ffffffff80251e3f>] run_workqueue+0xa8/0xfb
[<ffffffff8024e5e0>] worker_thread+0xef/0x122
[<ffffffff8023660f>] kthread+0x100/0x136
[<ffffffff8026419e>] child_rip+0x8/0x12
This can occur when release_console_sem() is called but the log
buffer still has contents that need to be flushed. The console drivers
are called while the console_may_schedule flag is still true. The
might_sleep() is triggered when fbcon calls console_conditional_schedule().
Fix by setting console_may_schedule to zero earlier, before the call to the
console drivers.
Signed-off-by: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When delivering PTRACE_EVENT_VFORK_DONE, provide pid of the child process
when tracer calls ptrace(PTRACE_GETEVENTMSG). This is already
(accidentally) available when the tracer is tracing VFORK in addition to
VFORK_DONE.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Daniel Jacobowitz <dan@debian.org>
Cc: Albert Cahalan <acahalan@gmail.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds a barrier() in futex unqueue_me to avoid aliasing of two
pointers.
On my s390x system I saw the following oops:
Unable to handle kernel pointer dereference at virtual kernel address
0000000000000000
Oops: 0004 [#1]
CPU: 0 Not tainted
Process mytool (pid: 13613, task: 000000003ecb6ac0, ksp: 00000000366bdbd8)
Krnl PSW : 0704d00180000000 00000000003c9ac2 (_spin_lock+0xe/0x30)
Krnl GPRS: 00000000ffffffff 000000003ecb6ac0 0000000000000000 0700000000000000
0000000000000000 0000000000000000 000001fe00002028 00000000000c091f
000001fe00002054 000001fe00002054 0000000000000000 00000000366bddc0
00000000005ef8c0 00000000003d00e8 0000000000144f91 00000000366bdcb8
Krnl Code: ba 4e 20 00 12 44 b9 16 00 3e a7 84 00 08 e3 e0 f0 88 00 04
Call Trace:
([<0000000000144f90>] unqueue_me+0x40/0xe4)
[<0000000000145a0c>] do_futex+0x33c/0xc40
[<000000000014643e>] sys_futex+0x12e/0x144
[<000000000010bb00>] sysc_noemu+0x10/0x16
[<000002000003741c>] 0x2000003741c
The code in question is:
static int unqueue_me(struct futex_q *q)
{
int ret = 0;
spinlock_t *lock_ptr;
/* In the common case we don't take the spinlock, which is nice. */
retry:
lock_ptr = q->lock_ptr;
if (lock_ptr != 0) {
spin_lock(lock_ptr);
/*
* q->lock_ptr can change between reading it and
* spin_lock(), causing us to take the wrong lock. This
* corrects the race condition.
[...]
and my compiler (gcc 4.1.0) makes the following out of it:
00000000000003c8 <unqueue_me>:
3c8: eb bf f0 70 00 24 stmg %r11,%r15,112(%r15)
3ce: c0 d0 00 00 00 00 larl %r13,3ce <unqueue_me+0x6>
3d0: R_390_PC32DBL .rodata+0x2a
3d4: a7 f1 1e 00 tml %r15,7680
3d8: a7 84 00 01 je 3da <unqueue_me+0x12>
3dc: b9 04 00 ef lgr %r14,%r15
3e0: a7 fb ff d0 aghi %r15,-48
3e4: b9 04 00 b2 lgr %r11,%r2
3e8: e3 e0 f0 98 00 24 stg %r14,152(%r15)
3ee: e3 c0 b0 28 00 04 lg %r12,40(%r11)
/* write q->lock_ptr in r12 */
3f4: b9 02 00 cc ltgr %r12,%r12
3f8: a7 84 00 4b je 48e <unqueue_me+0xc6>
/* if r12 is zero then jump over the code.... */
3fc: e3 20 b0 28 00 04 lg %r2,40(%r11)
/* write q->lock_ptr in r2 */
402: c0 e5 00 00 00 00 brasl %r14,402 <unqueue_me+0x3a>
404: R_390_PC32DBL _spin_lock+0x2
/* use r2 as parameter for spin_lock */
So the code becomes more or less:
if (q->lock_ptr != 0) spin_lock(q->lock_ptr)
instead of
if (lock_ptr != 0) spin_lock(lock_ptr)
Which caused the oops from above.
After adding a barrier gcc creates code without this problem:
[...] (the same)
3ee: e3 c0 b0 28 00 04 lg %r12,40(%r11)
3f4: b9 02 00 cc ltgr %r12,%r12
3f8: b9 04 00 2c lgr %r2,%r12
3fc: a7 84 00 48 je 48c <unqueue_me+0xc4>
400: c0 e5 00 00 00 00 brasl %r14,400 <unqueue_me+0x38>
402: R_390_PC32DBL _spin_lock+0x2
As a general note, this code of unqueue_me seems a bit fishy. The retry logic
of unqueue_me only works if we can guarantee, that the original value of
q->lock_ptr is always a spinlock (Otherwise we overwrite kernel memory). We
know that q->lock_ptr can change. I dont know what happens with the original
spinlock, as I am not an expert with the futex code.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@timesys.com>
Signed-off-by: Christian Borntraeger <borntrae@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It should be possible to suspend, either to RAM or to disk, if there's a
traced process that has just reached a breakpoint. However, this is a
special case, because its parent process might have been frozen already and
then we are unable to deliver the "freeze" signal to the traced process.
If this happens, it's better to cancel the freezing of the traced process.
Ref. http://bugzilla.kernel.org/show_bug.cgi?id=6787
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Michael C Thompson wrote: [Tue Aug 01 2006, 02:36:36PM EDT]
> The trigger for this oops is:
> # auditctl -a exit,always -S pread64 -F 'inode<1'
Setting the err value will fix it.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Always initialize the audit_inode_hash[] so we don't oops on list rules.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When an object is created via a symlink into an audited directory, audit misses
the event due to not having collected the inode data for the directory. Modify
__audit_inode_child() to copy the parent inode data if a parent wasn't found in
audit_names[].
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When the specified path is an existing file or when it is a symlink, audit
collects the wrong inode number, which causes it to miss the open() event.
Adding a second hook to the open() path fixes this.
Also add audit_copy_inode() to consolidate some code.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>