Commit Graph

3901 Commits

Author SHA1 Message Date
Linus Torvalds
a7f75d3bed Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: re-tune NUMA topologies
  sched: stop wake_affine from causing serious imbalance
  sched: fix sched_clock_cpu()
  revert ("sched: fair-group: SMP-nice for group scheduling")
  sched: cleanup
  show_schedstat(): fix memleak
  sched: unite unlikely pairs in rt_policy() and schedule_debug()
  revert ("sched: fair: weight calculations")
2008-05-29 09:26:17 -07:00
Ingo Molnar
6715930654 Merge commit 'linus/master' into sched-fixes-for-linus 2008-05-29 16:05:05 +02:00
Mike Galbraith
b3137bc8e7 sched: stop wake_affine from causing serious imbalance
Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:29:20 +02:00
Peter Zijlstra
a381759d6a sched: fix sched_clock_cpu()
Make sched_clock_cpu() return 0 before it has been initialized and avoid
corrupting its state due to doing so.

This fixes the weird printk timestamp jump reported.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-05-29 11:29:19 +02:00
Ingo Molnar
6363ca57c7 revert ("sched: fair-group: SMP-nice for group scheduling")
Yanmin Zhang reported:

Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1.
It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito.

With bisect, I located the following patch:

| 18d95a2832 is first bad commit
| commit 18d95a2832
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date:   Sat Apr 19 19:45:00 2008 +0200
|
|     sched: fair-group: SMP-nice for group scheduling

Revert it so that we get v2.6.25 behavior.

Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:28:57 +02:00
Ingo Molnar
4285f594f8 sched: cleanup
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:25:15 +02:00
Adrian Bunk
c6fba5451a show_schedstat(): fix memleak
The Coverity checker spotted a memleak introduced by commit
39106dcf85 (cpumask: use new cpus_scnprintf
function).

It seems the kfree() got lost between v2 and v3 of this patch...

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Mike Travis <travis@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:25:15 +02:00
Roel Kluin
3f33a7ce95 sched: unite unlikely pairs in rt_policy() and schedule_debug()
Removes obfuscation and may improve assembly.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:25:14 +02:00
Ingo Molnar
f9305d4a09 revert ("sched: fair: weight calculations")
Yanmin Zhang reported:

Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has many
regressions with 2.6.26-rc1:

 1) 8-core stoakley: 28%;
 2) 16-core tigerton: 20%;
 3) Itanium Montvale: 50%.

Bisect located this patch:

| 8f1bc385cf is first bad commit
| commit 8f1bc385cf
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date:   Sat Apr 19 19:45:00 2008 +0200
|
|     sched: fair: weight calculations

Revert it to the 2.6.25 state.

Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:24:01 +02:00
Tom Zanussi
a82c53a0e3 splice: fix sendfile() issue with relay
Splice isn't always incrementing the ppos correctly, which broke
relay splice.

Signed-off-by: Tom Zanussi <zanussi@comcast.net>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Oleg Nesterov
cbaffba12c posix timers: discard SI_TIMER signals on exec
Based on Roland's patch. This approach was suggested by Austin Clements
from the very beginning, and then by Linus.

As Austin pointed out, the execing task can be killed by SI_TIMER signal
because exec flushes the signal handlers, but doesn't discard the pending
signals generated by posix timers. Perhaps not a bug, but people find this
surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26 10:37:07 -07:00
Oleg Nesterov
c8e85b4f4b posix timers: sigqueue_free: don't free sigqueue if it is queued
Currently sigqueue_free() removes sigqueue from list, but doesn't cancel the
pending signal. This is not consistent, the task should either receive the
"full" signal along with siginfo_t, or it shouldn't receive the signal at all.

Change sigqueue_free() to clear SIGQUEUE_PREALLOC but leave sigqueue on list
if it is queued.

This is a user-visible change. If the signal is blocked, it stays queued
after sys_timer_delete() until unblocked with the "stale" si_code/si_value,
and of course it is still counted wrt RLIMIT_SIGPENDING which also limits
the number of posix timers.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26 10:37:06 -07:00
Cedric Le Goater
5c02b57578 cgroups: remove node_ prefix_from ns subsystem
This is a slight change in the namespace cgroup subsystem api.

The change is that previously when cgroup_clone() was called (currently
only from the unshare path in ns_proxy cgroup, you'd get a new group named
"node_$pid" whereas now you'll get a group named after just your pid.)

The only users who would notice it are those who are using the ns_proxy
cgroup subsystem to auto-create cgroups when namespaces are unshared -
something of an experimental feature, which I think really needs more
complete container/namespace support in order to be useful.  I suspect the
only users are Cedric and Serge, or maybe a few others on
containers@lists.linux-foundation.org.  And in fact it would only be
noticed by the users who make the assumption about how the name is
generated, rather than getting it from the /proc/<pid>/cgroups file for
the process in question.

Whether the change is actually needed or not I'm fairly agnostic on, but I
guess it is more elegant to just use the pid as the new group name rather
than adding a fairly arbitrary "node_" prefix on the front.

[menage@google.com: provided changelog]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Paul Menage" <menage@google.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:14 -07:00
Shi Weihua
7b26655f62 sys_prctl(): fix return of uninitialized value
If none of the switch cases match, the PR_SET_PDEATHSIG and
PR_SET_DUMPABLE cases of the switch statement will never write to local
variable `error'.

Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Cc: Andrew G. Morgan <morgan@kernel.org>
Acked-by: "Serge E. Hallyn" <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:13 -07:00
Oleg Nesterov
da7978b034 signals: fix sigqueue_free() vs __exit_signal() race
__exit_signal() does flush_sigqueue(tsk->pending) outside of ->siglock.
This can race with another thread doing sigqueue_free(), we can free the
same SIGQUEUE_PREALLOC sigqueue twice or corrupt the pending->list.

Note that even sys_exit_group() can trigger this race, not only
sys_timer_delete().

Move the callsite of flush_sigqueue(tsk->pending) under ->siglock.

This patch doesn't touch flush_sigqueue(->shared_pending) below, it is
called when there are no other threads which can play with signals, and
sigqueue_free() can't be used outside of our thread group.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:10 -07:00
Christian Borntraeger
3401a61e16 stop_machine: make stop_machine_run more virtualization friendly
On kvm I have seen some rare hangs in stop_machine when I used more guest
cpus than hosts cpus. e.g. 32 guest cpus on 1 host cpu triggered the
hang quite often. I could also reproduce the problem on a 4 way z/VM host with
a 64 way guest.

It turned out that the guest was consuming all available cpus mostly for
spinning on scheduler locks like rq->lock. This is expected as the threads are
calling yield all the time.
The problem is now, that the host scheduling decisings together with the guest
scheduling decisions and spinlocks not being fair managed to create an
interesting scenario similar to a live lock. (Sometimes the hang resolved
itself after some minutes)

Changing stop_machine to yield the cpu to the hypervisor when yielding inside
the guest fixed the problem for me. While I am not completely happy with this
patch, I think it causes no harm and it really improves the situation for me.

I used cpu_relax for yielding to the hypervisor, does that work on all
architectures?

p.s.: If you want to reproduce the problem, cpu hotplug and kprobes use
stop_machine_run and both triggered the problem after some retries.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
CC: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23 13:09:34 +10:00
Denis V. Lunev
34e4e2fef4 modules: proper cleanup of kobject without CONFIG_SYSFS
kobject: '<NULL>' (ffffffffa0104050): is not initialized, yet kobject_put() is being called.
------------[ cut here ]------------
WARNING: at /home/den/src/linux-netns26/lib/kobject.c:583 kobject_put+0x53/0x55()
Modules linked in: ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ide_cd_mod cdrom button [last unloaded: pktgen]
comm: rmmod Tainted: G        W 2.6.26-rc3 #585
Call Trace:
  [<ffffffff802359ab>] warn_on_slowpath+0x58/0x7a
  [<ffffffff80236aca>] ? printk+0x67/0x69
  [<ffffffff80236aca>] ? printk+0x67/0x69
  [<ffffffff80324289>] kobject_put+0x53/0x55
  [<ffffffff8025e2ee>] free_module+0x87/0xfa
  [<ffffffff8025fee5>] sys_delete_module+0x178/0x1e1
  [<ffffffff804b1e70>] ? lockdep_sys_exit_thunk+0x35/0x67
  [<ffffffff804b1dff>] ? trace_hardirqs_on_thunk+0x35/0x3a
  [<ffffffff8020c0bb>] system_call_after_swapgs+0x7b/0x80
---[ end trace 8f5aafa7f6406cf8 ]---

mod->mkobj.kobj is not initialized without CONFIG_SYSFS. Do not call
kobject_put in this case.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23 13:09:33 +10:00
Cyrill Gorcunov
c4ea6fcf5a module loading ELF handling: use SELFMAG instead of numeric constant
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23 13:09:32 +10:00
Linus Torvalds
16ae527bfa Merge branch 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] list_for_each_rcu must die: audit
  [patch 1/1] audit_send_reply(): fix error-path memory leak
  [PATCH] open sessionid permissions
2008-05-19 16:38:10 -07:00
Paul E. McKenney
6793a051fb [PATCH] list_for_each_rcu must die: audit
All uses of list_for_each_rcu() can be profitably replaced by the
easier-to-use list_for_each_entry_rcu().  This patch makes this change
for the Audit system, in preparation for removing the list_for_each_rcu()
API entirely.  This time with well-formed SOB.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-17 03:30:23 -04:00
Andrew Morton
fcaf1eb868 [patch 1/1] audit_send_reply(): fix error-path memory leak
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10663

Reporter: Daniel Marjamki <danielm77@spray.se>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-17 03:30:22 -04:00
Al Viro
eceea0b3df [PATCH] avoid multiplication overflows and signedness issues for max_fds
Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and
we don't want size calculation for ->fd[] to overflow.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:52 -04:00
Al Viro
02afc6267f [PATCH] dup_fd() fixes, part 1
Move the sucker to fs/file.c in preparation to the rest

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:26 -04:00
Harvey Harrison
3fc957721d lib: create common ascii hex array
Add a common hex array in hexdump.c so everyone can use it.

Add a common hi/lo helper to avoid the shifting masking that is
done to get the upper and lower nibbles of a byte value.

Pull the pack_hex_byte helper from kgdb as it is opencoded many
places in the tree that will be consolidated.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:14 -07:00
Mirco Tischler
0c70814c31 cgroups: fix compile warning
Return type of cpu_rt_runtime_write() should be int instead of ssize_t.

Signed-off-by: Mirco Tischler <mt-ml@gmx.de>
Acked-by: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:14 -07:00