Commit Graph

3892 Commits

Author SHA1 Message Date
Tom Zanussi
a82c53a0e3 splice: fix sendfile() issue with relay
Splice isn't always incrementing the ppos correctly, which broke
relay splice.

Signed-off-by: Tom Zanussi <zanussi@comcast.net>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Oleg Nesterov
cbaffba12c posix timers: discard SI_TIMER signals on exec
Based on Roland's patch. This approach was suggested by Austin Clements
from the very beginning, and then by Linus.

As Austin pointed out, the execing task can be killed by SI_TIMER signal
because exec flushes the signal handlers, but doesn't discard the pending
signals generated by posix timers. Perhaps not a bug, but people find this
surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26 10:37:07 -07:00
Oleg Nesterov
c8e85b4f4b posix timers: sigqueue_free: don't free sigqueue if it is queued
Currently sigqueue_free() removes sigqueue from list, but doesn't cancel the
pending signal. This is not consistent, the task should either receive the
"full" signal along with siginfo_t, or it shouldn't receive the signal at all.

Change sigqueue_free() to clear SIGQUEUE_PREALLOC but leave sigqueue on list
if it is queued.

This is a user-visible change. If the signal is blocked, it stays queued
after sys_timer_delete() until unblocked with the "stale" si_code/si_value,
and of course it is still counted wrt RLIMIT_SIGPENDING which also limits
the number of posix timers.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26 10:37:06 -07:00
Cedric Le Goater
5c02b57578 cgroups: remove node_ prefix_from ns subsystem
This is a slight change in the namespace cgroup subsystem api.

The change is that previously when cgroup_clone() was called (currently
only from the unshare path in ns_proxy cgroup, you'd get a new group named
"node_$pid" whereas now you'll get a group named after just your pid.)

The only users who would notice it are those who are using the ns_proxy
cgroup subsystem to auto-create cgroups when namespaces are unshared -
something of an experimental feature, which I think really needs more
complete container/namespace support in order to be useful.  I suspect the
only users are Cedric and Serge, or maybe a few others on
containers@lists.linux-foundation.org.  And in fact it would only be
noticed by the users who make the assumption about how the name is
generated, rather than getting it from the /proc/<pid>/cgroups file for
the process in question.

Whether the change is actually needed or not I'm fairly agnostic on, but I
guess it is more elegant to just use the pid as the new group name rather
than adding a fairly arbitrary "node_" prefix on the front.

[menage@google.com: provided changelog]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Paul Menage" <menage@google.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:14 -07:00
Shi Weihua
7b26655f62 sys_prctl(): fix return of uninitialized value
If none of the switch cases match, the PR_SET_PDEATHSIG and
PR_SET_DUMPABLE cases of the switch statement will never write to local
variable `error'.

Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Cc: Andrew G. Morgan <morgan@kernel.org>
Acked-by: "Serge E. Hallyn" <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:13 -07:00
Oleg Nesterov
da7978b034 signals: fix sigqueue_free() vs __exit_signal() race
__exit_signal() does flush_sigqueue(tsk->pending) outside of ->siglock.
This can race with another thread doing sigqueue_free(), we can free the
same SIGQUEUE_PREALLOC sigqueue twice or corrupt the pending->list.

Note that even sys_exit_group() can trigger this race, not only
sys_timer_delete().

Move the callsite of flush_sigqueue(tsk->pending) under ->siglock.

This patch doesn't touch flush_sigqueue(->shared_pending) below, it is
called when there are no other threads which can play with signals, and
sigqueue_free() can't be used outside of our thread group.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:10 -07:00
Christian Borntraeger
3401a61e16 stop_machine: make stop_machine_run more virtualization friendly
On kvm I have seen some rare hangs in stop_machine when I used more guest
cpus than hosts cpus. e.g. 32 guest cpus on 1 host cpu triggered the
hang quite often. I could also reproduce the problem on a 4 way z/VM host with
a 64 way guest.

It turned out that the guest was consuming all available cpus mostly for
spinning on scheduler locks like rq->lock. This is expected as the threads are
calling yield all the time.
The problem is now, that the host scheduling decisings together with the guest
scheduling decisions and spinlocks not being fair managed to create an
interesting scenario similar to a live lock. (Sometimes the hang resolved
itself after some minutes)

Changing stop_machine to yield the cpu to the hypervisor when yielding inside
the guest fixed the problem for me. While I am not completely happy with this
patch, I think it causes no harm and it really improves the situation for me.

I used cpu_relax for yielding to the hypervisor, does that work on all
architectures?

p.s.: If you want to reproduce the problem, cpu hotplug and kprobes use
stop_machine_run and both triggered the problem after some retries.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
CC: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23 13:09:34 +10:00
Denis V. Lunev
34e4e2fef4 modules: proper cleanup of kobject without CONFIG_SYSFS
kobject: '<NULL>' (ffffffffa0104050): is not initialized, yet kobject_put() is being called.
------------[ cut here ]------------
WARNING: at /home/den/src/linux-netns26/lib/kobject.c:583 kobject_put+0x53/0x55()
Modules linked in: ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ide_cd_mod cdrom button [last unloaded: pktgen]
comm: rmmod Tainted: G        W 2.6.26-rc3 #585
Call Trace:
  [<ffffffff802359ab>] warn_on_slowpath+0x58/0x7a
  [<ffffffff80236aca>] ? printk+0x67/0x69
  [<ffffffff80236aca>] ? printk+0x67/0x69
  [<ffffffff80324289>] kobject_put+0x53/0x55
  [<ffffffff8025e2ee>] free_module+0x87/0xfa
  [<ffffffff8025fee5>] sys_delete_module+0x178/0x1e1
  [<ffffffff804b1e70>] ? lockdep_sys_exit_thunk+0x35/0x67
  [<ffffffff804b1dff>] ? trace_hardirqs_on_thunk+0x35/0x3a
  [<ffffffff8020c0bb>] system_call_after_swapgs+0x7b/0x80
---[ end trace 8f5aafa7f6406cf8 ]---

mod->mkobj.kobj is not initialized without CONFIG_SYSFS. Do not call
kobject_put in this case.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23 13:09:33 +10:00
Cyrill Gorcunov
c4ea6fcf5a module loading ELF handling: use SELFMAG instead of numeric constant
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23 13:09:32 +10:00
Linus Torvalds
16ae527bfa Merge branch 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] list_for_each_rcu must die: audit
  [patch 1/1] audit_send_reply(): fix error-path memory leak
  [PATCH] open sessionid permissions
2008-05-19 16:38:10 -07:00
Paul E. McKenney
6793a051fb [PATCH] list_for_each_rcu must die: audit
All uses of list_for_each_rcu() can be profitably replaced by the
easier-to-use list_for_each_entry_rcu().  This patch makes this change
for the Audit system, in preparation for removing the list_for_each_rcu()
API entirely.  This time with well-formed SOB.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-17 03:30:23 -04:00
Andrew Morton
fcaf1eb868 [patch 1/1] audit_send_reply(): fix error-path memory leak
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10663

Reporter: Daniel Marjamki <danielm77@spray.se>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-17 03:30:22 -04:00
Al Viro
eceea0b3df [PATCH] avoid multiplication overflows and signedness issues for max_fds
Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and
we don't want size calculation for ->fd[] to overflow.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:52 -04:00
Al Viro
02afc6267f [PATCH] dup_fd() fixes, part 1
Move the sucker to fs/file.c in preparation to the rest

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:26 -04:00
Harvey Harrison
3fc957721d lib: create common ascii hex array
Add a common hex array in hexdump.c so everyone can use it.

Add a common hi/lo helper to avoid the shifting masking that is
done to get the upper and lower nibbles of a byte value.

Pull the pack_hex_byte helper from kgdb as it is opencoded many
places in the tree that will be consolidated.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:14 -07:00
Mirco Tischler
0c70814c31 cgroups: fix compile warning
Return type of cpu_rt_runtime_write() should be int instead of ssize_t.

Signed-off-by: Mirco Tischler <mt-ml@gmx.de>
Acked-by: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:14 -07:00
Linus Torvalds
c3921ab715 Add new 'cond_resched_bkl()' helper function
It acts exactly like a regular 'cond_resched()', but will not get
optimized away when CONFIG_PREEMPT is set.

Normal kernel code is already preemptable in the presense of
CONFIG_PREEMPT, so cond_resched() is optimized away (see commit
02b67cc3ba "sched: do not do
cond_resched() when CONFIG_PREEMPT").

But when wanting to conditionally reschedule while holding a lock, you
need to use "cond_sched_lock(lock)", and the new function is the BKL
equivalent of that.

Also make fs/locks.c use it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-11 16:04:48 -07:00
Linus Torvalds
8e3e076c5a BKL: revert back to the old spinlock implementation
The generic semaphore rewrite had a huge performance regression on AIM7
(and potentially other BKL-heavy benchmarks) because the generic
semaphores had been rewritten to be simple to understand and fair.  The
latter, in particular, turns a semaphore-based BKL implementation into a
mess of scheduling.

The attempt to fix the performance regression failed miserably (see the
previous commit 00b41ec261 'Revert
"semaphore: fix"'), and so for now the simple and sane approach is to
instead just go back to the old spinlock-based BKL implementation that
never had any issues like this.

This patch also has the advantage of being reported to fix the
regression completely according to Yanmin Zhang, unlike the semaphore
hack which still left a couple percentage point regression.

As a spinlock, the BKL obviously has the potential to be a latency
issue, but it's not really any different from any other spinlock in that
respect.  We do want to get rid of the BKL asap, but that has been the
plan for several years.

These days, the biggest users are in the tty layer (open/release in
particular) and Alan holds out some hope:

  "tty release is probably a few months away from getting cured - I'm
   afraid it will almost certainly be the very last user of the BKL in
   tty to get fixed as it depends on everything else being sanely locked."

so while we're not there yet, we do have a plan of action.

Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-10 20:58:02 -07:00
Linus Torvalds
00b41ec261 Revert "semaphore: fix"
This reverts commit bf726eab37, as it has
been reported to cause a regression with processes stuck in __down(),
apparently because some missing wakeup.

Quoth Sven Wegener:
 "I'm currently investigating a regression that has showed up with my
  last git pull yesterday.  Bisecting the commits showed bf726e
  "semaphore: fix" to be the culprit, reverting it fixed the issue.

  Symptoms: During heavy filesystem usage (e.g.  a kernel compile) I get
  several compiler processes in uninterruptible sleep, blocking all i/o
  on the filesystem.  System is an Intel Core 2 Quad running a 64bit
  kernel and userspace.  Filesystem is xfs on top of lvm.  See below for
  the output of sysrq-w."

See

	http://lkml.org/lkml/2008/5/10/45

for full report.

In the meantime, we can just fix the BKL performance regression by
reverting back to the good old BKL spinlock implementation instead,
since any sleeping lock will generally perform badly, especially if it
tries to be fair.

Reported-by: Sven Wegener <sven.wegener@stealer.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-10 20:43:22 -07:00
Rusty Russell
91e37a793b module: don't ignore vermagic string if module doesn't have modversions
Linus found a logic bug: we ignore the version number in a module's
vermagic string if we have CONFIG_MODVERSIONS set, but modversions
also lets through a module with no __versions section for modprobe
--force (with tainting, but still).

We should only ignore the start of the vermagic string if the module
actually *has* crcs to check.  Rather than (say) having an
entertaining hissy fit and creating a config option to work around the
buggy code.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-09 07:45:18 -07:00
Rusty Russell
a5dd697074 module: be more picky about allowing missing module versions
We allow missing __versions sections, because modprobe --force strips
it.  It makes less sense to allow sections where there's no version
for a specific symbol the module uses, so disallow that.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-09 07:45:18 -07:00
Linus Torvalds
c4f51b4662 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes:
  sched: fix weight calculations
  semaphore: fix
2008-05-08 11:31:07 -07:00
Linus Torvalds
7a34912d90 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  Revert "relay: fix splice problem"
  docbook: fix bio missing parameter
  block: use unitialized_var() in bio_alloc_bioset()
  block: avoid duplicate calls to get_part() in disk stat code
  cfq-iosched: make io priorities inherit CPU scheduling class as well as nice
  block: optimize generic_unplug_device()
  block: get rid of likely/unlikely predictions in merge logic
  vfs: splice remove_suid() cleanup
  cfq-iosched: fix RCU race in the cfq io_context destructor handling
  block: adjust tagging function queue bit locking
  block: sysfs store function needs to grab queue_lock and use queue_flag_*()
2008-05-08 10:48:36 -07:00
Paul Menage
5be7a4792a Fix cpuset sched_relax_domain_level control file
Due to a merge conflict, the sched_relax_domain_level control file was marked
as being handled by cpuset_read/write_u64, but the code to handle it was
actually in cpuset_common_file_read/write.

Since the value being written/read is in fact a signed integer, it should be
treated as such; this patch adds cpuset_read/write_s64 functions, and uses
them to handle the sched_relax_domain_level file.

With this patch, the sched_relax_domain_level can be read and written, and the
correct contents seen/updated.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-08 10:46:56 -07:00
Mike Galbraith
46151122e0 sched: fix weight calculations
The conversion between virtual and real time is as follows:

  dvt = rw/w * dt <=> dt = w/rw * dvt

Since we want the fair sleeper granularity to be in real time, we actually
need to do:

  dvt = - rw/w * l

This bug could be related to the regression reported by Yanmin Zhang:

| Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots
| of regressions with 2.6.26-rc1:
|
| 1) 8-core stoakley: 28%;
| 2) 16-core tigerton: 20%;
| 3) Itanium Montvale: 50%.

Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-08 17:00:42 +02:00