Commit Graph

5257 Commits

Author SHA1 Message Date
Serge E. Hallyn
94d6a5f734 user namespaces: document CFS behavior
Documented the currently bogus state of support for CFS user groups with
user namespaces.  In particular, all users in a user namespace should be
children of the user which created the user namespace.  This is yet to
be implemented.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-12-09 09:25:53 +11:00
Serge E. Hallyn
7657d90497 user namespaces: require cap_set{ug}id for CLONE_NEWUSER
While ideally CLONE_NEWUSER will eventually require no
privilege, the required permission checks are currently
not there.  As a result, CLONE_NEWUSER has the same effect
as a setuid(0)+setgroups(1,"0").  While we already require
CAP_SYS_ADMIN, requiring CAP_SETUID and CAP_SETGID seems
appropriate.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-12-08 09:16:27 +11:00
Serge E. Hallyn
c37bbb0fdc user namespaces: let user_ns be cloned with fairsched
(These two patches are in the next-unacked branch of
git://git.kernel.org/pub/scm/linux/kernel/git/sergeh/userns-2.6.
If they get some ACKs, then I hope to feed this into security-next.
After these two, I think we're ready to tackle userns+capabilities)

Fairsched creates a per-uid directory under /sys/kernel/uids/.
So when you clone(CLONE_NEWUSER), it tries to create
/sys/kernel/uids/0, which already exists, and you get back
-ENOMEM.

This was supposed to be fixed by sysfs tagging, but that
was postponed (ok, rejected until sysfs locking is fixed).
So, just as with network namespaces, we just don't create
those directories for user namespaces other than the init.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-12-08 09:16:22 +11:00
James Morris
ec98ce480a Merge branch 'master' into next
Conflicts:
	fs/nfsd/nfs4recover.c

Manually fixed above to use new creds API functions, e.g.
nfs4_save_creds().

Signed-off-by: James Morris <jmorris@namei.org>
2008-12-04 17:16:36 +11:00
Arjan van de Ven
a800599283 taint: add missing comment
The description for 'D' was missing in the comment...  (causing me a
minute of WTF followed by looking at more of the code)

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01 19:55:24 -08:00
Davide Libenzi
7ef9964e6d epoll: introduce resource usage limits
It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface.  Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds.  To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced.  A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:

  max_user_instances = Maximum number of devices - per user

  max_user_watches   = Maximum number of "watched" fds - per user

The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM.  As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users.  The
default value for "max_user_instances" is set to 128, that should be
enough too.

This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC).  The EMFILE from epoll_create() was already
listed, so that should be ok.

[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01 19:55:24 -08:00
Linus Torvalds
9bd062d9ea Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: prevent divide by zero error in cpu_avg_load_per_task, update
  sched, cpusets: fix warning in kernel/cpuset.c
  sched: prevent divide by zero error in cpu_avg_load_per_task
2008-11-30 13:06:47 -08:00
Linus Torvalds
72244c0e68 Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  irq.h: fix missing/extra kernel-doc
  genirq: __irq_set_trigger: change pr_warning to pr_debug
  irq: fix typo
  x86: apic honour irq affinity which was set in early boot
  genirq: fix the affinity setting in setup_irq
  genirq: keep affinities set from userspace across free/request_irq()
2008-11-30 13:06:20 -08:00
Linus Torvalds
93b10052f9 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  lockdep: consistent alignement for lockdep info
2008-11-30 13:05:46 -08:00
Linus Torvalds
7bbc67fbf6 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  ftrace: prevent recursion
  tracing, doc: update mmiotrace documentation
  x86, mmiotrace: fix buffer overrun detection
  function tracing: fix wrong position computing of stack_trace
2008-11-30 13:05:31 -08:00
Christoph Hellwig
96b8936a9e remove __ARCH_WANT_COMPAT_SYS_PTRACE
All architectures now use the generic compat_sys_ptrace, as should every
new architecture that needs 32bit compat (if we'll ever get another).

Remove the now superflous __ARCH_WANT_COMPAT_SYS_PTRACE define, and also
kill a comment about __ARCH_SYS_PTRACE that was added after
__ARCH_SYS_PTRACE was already gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-30 11:00:15 -08:00
Al Viro
8419641450 cpuinit fixes in kernel/*
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-30 10:03:37 -08:00
Ingo Molnar
af6d596fd6 sched: prevent divide by zero error in cpu_avg_load_per_task, update
Regarding the bug addressed in:

  4cd4262: sched: prevent divide by zero error in cpu_avg_load_per_task

Linus points out that the fix is not complete:

> There's nothing that keeps gcc from deciding not to reload
> rq->nr_running.
>
> Of course, in _practice_, I don't think gcc ever will (if it decides
> that it will spill, gcc is likely going to decide that it will
> literally spill the local variable to the stack rather than decide to
> reload off the pointer), but it's a valid compiler optimization, and
> it even has a name (rematerialization).
>
> So I suspect that your patch does fix the bug, but it still leaves the
> fairly unlikely _potential_ for it to re-appear at some point.
>
> We have ACCESS_ONCE() as a macro to guarantee that the compiler
> doesn't rematerialize a pointer access. That also would clarify
> the fact that we access something unsafe outside a lock.

So make sure our nr_running value is immutable and cannot change
after we check it for nonzero.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-29 20:45:15 +01:00
Ingo Molnar
1583715ddb sched, cpusets: fix warning in kernel/cpuset.c
this warning:

  kernel/cpuset.c: In function ‘generate_sched_domains’:
  kernel/cpuset.c:588: warning: ‘ndoms’ may be used uninitialized in this function

triggers because GCC does not recognize that ndoms stays uninitialized
only if doms is NULL - but that flow is covered at the end of
generate_sched_domains().

Help out GCC by initializing this variable to 0. (that's prudent anyway)

Also, this function needs a splitup and code flow simplification:
with 160 lines length it's clearly too long.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-29 20:39:29 +01:00
Steven Rostedt
4cd4262034 sched: prevent divide by zero error in cpu_avg_load_per_task
Impact: fix divide by zero crash in scheduler rebalance irq

While testing the branch profiler, I hit this crash:

divide error: 0000 [#1] PREEMPT SMP
[...]
RIP: 0010:[<ffffffff8024a008>]  [<ffffffff8024a008>] cpu_avg_load_per_task+0x50/0x7f
[...]
Call Trace:
 <IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa
 [<ffffffff8025da75>] rebalance_domains+0x2da/0xa21
 [<ffffffff80478769>] ? find_next_bit+0x1b2/0x1e6
 [<ffffffff8025e2ce>] run_rebalance_domains+0x112/0x19f
 [<ffffffff8026d7c2>] __do_softirq+0xa8/0x232
 [<ffffffff8020ea7c>] call_softirq+0x1c/0x3e
 [<ffffffff8021047a>] do_softirq+0x94/0x1cd
 [<ffffffff8026d5eb>] irq_exit+0x6b/0x10e
 [<ffffffff8022e6ec>] smp_apic_timer_interrupt+0xd3/0xff
 [<ffffffff8020e4b3>] apic_timer_interrupt+0x13/0x20

The code for cpu_avg_load_per_task has:

	if (rq->nr_running)
		rq->avg_load_per_task = rq->load.weight / rq->nr_running;

The runqueue lock is not held here, and there is nothing that prevents
the rq->nr_running from going to zero after it passes the if condition.

The branch profiler simply made the race window bigger.

This patch saves off the rq->nr_running to a local variable and uses that
for both the condition and the division.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-27 10:29:52 +01:00
Lai Jiangshan
4f5a7f40dd ftrace: prevent recursion
Impact: prevent unnecessary stack recursion

if the resched flag was set before we entered, then don't reschedule.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-27 10:11:53 +01:00
Serge Hallyn
6ded6ab9be User namespaces: use the current_user_ns() macro
Fix up the last current_user()->user_ns instance to use
current_user_ns().

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
2008-11-24 18:57:46 -05:00
Serge Hallyn
18b6e0414e User namespaces: set of cleanups (v2)
The user_ns is moved from nsproxy to user_struct, so that a struct
cred by itself is sufficient to determine access (which it otherwise
would not be).  Corresponding ecryptfs fixes (by David Howells) are
here as well.

Fix refcounting.  The following rules now apply:
        1. The task pins the user struct.
        2. The user struct pins its user namespace.
        3. The user namespace pins the struct user which created it.

User namespaces are cloned during copy_creds().  Unsharing a new user_ns
is no longer possible.  (We could re-add that, but it'll cause code
duplication and doesn't seem useful if PAM doesn't need to clone user
namespaces).

When a user namespace is created, its first user (uid 0) gets empty
keyrings and a clean group_info.

This incorporates a previous patch by David Howells.  Here
is his original patch description:

>I suggest adding the attached incremental patch.  It makes the following
>changes:
>
> (1) Provides a current_user_ns() macro to wrap accesses to current's user
>     namespace.
>
> (2) Fixes eCryptFS.
>
> (3) Renames create_new_userns() to create_user_ns() to be more consistent
>     with the other associated functions and because the 'new' in the name is
>     superfluous.
>
> (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
>     beginning of do_fork() so that they're done prior to making any attempts
>     at allocation.
>
> (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
>     to fill in rather than have it return the new root user.  I don't imagine
>     the new root user being used for anything other than filling in a cred
>     struct.
>
>     This also permits me to get rid of a get_uid() and a free_uid(), as the
>     reference the creds were holding on the old user_struct can just be
>     transferred to the new namespace's creator pointer.
>
> (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
>     preparation rather than doing it in copy_creds().
>
>David

>Signed-off-by: David Howells <dhowells@redhat.com>

Changelog:
	Oct 20: integrate dhowells comments
		1. leave thread_keyring alone
		2. use current_user_ns() in set_user()

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
2008-11-24 18:57:41 -05:00
Pekka Paalanen
7ee1768ddb x86, mmiotrace: fix buffer overrun detection
Impact: fix mmiotrace overrun tracing

When ftrace framework moved to use the ring buffer facility, the buffer
overrun detection was broken after 2.6.27 by commit

| commit 3928a8a2d9
| Author: Steven Rostedt <rostedt@goodmis.org>
| Date:   Mon Sep 29 23:02:41 2008 -0400
|
|     ftrace: make work with new ring buffer
|
|     This patch ports ftrace over to the new ring buffer.

The detection is now fixed by using the ring buffer API.

When mmiotrace detects a buffer overrun, it will report the number of
lost events. People reading an mmiotrace log must know if something was
missed, otherwise the data may not make sense.

Signed-off-by: Pekka Paalanen <pq@iki.fi>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-23 20:33:23 +01:00
Ingo Molnar
9f14416442 Merge commit 'v2.6.28-rc6' into irq/urgent 2008-11-23 10:52:33 +01:00
Li Zefan
b0788caf7a lockdep: consistent alignement for lockdep info
Impact: prettify /proc/lockdep_info

Just feel odd that not all lines of lockdep info are aligned.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-21 08:59:40 +01:00
Liming Wang
522a110b42 function tracing: fix wrong position computing of stack_trace
Impact: make output of stack_trace complete if buffer overruns

When read buffer overruns, the output of stack_trace isn't complete.

When printing records with seq_printf in t_show, if the read buffer
has overruned by the current record, then this record won't be
printed to user space through read buffer, it will just be dropped in
this printing.

When next printing, t_start should return the "*pos"th record, which
is the one dropped by previous printing, but it just returns
(m->private + *pos)th record.

Here we use a more sane method to implement seq_operations which can
be found in kernel code. Thus we needn't initialize m->private.

About testing, it's not easy to overrun read buffer, but we can use
seq_printf to print more padding bytes in t_show, then it's easy to
check whether or not records are lost.

This commit has been tested on both condition of overrun and non
overrun.

Signed-off-by: Liming Wang <liming.wang@windriver.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-21 08:49:52 +01:00
Linus Torvalds
95763dd52b Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  ftrace: fix dyn ftrace filter selection
  ftrace: make filtered functions effective on setting
  ftrace: fix set_ftrace_filter
  trace: introduce missing mutex_unlock()
  tracing: kernel/trace/trace.c: introduce missing kfree()
2008-11-20 13:11:21 -08:00
Li Zefan
33d283bef2 cgroups: fix a serious bug in cgroupstats
Try this, and you'll get oops immediately:
 # cd Documentation/accounting/
 # gcc -o getdelays getdelays.c
 # mount -t cgroup -o debug xxx /mnt
 # ./getdelays -C /mnt/tasks

Because a normal file's dentry->d_fsdata is a pointer to struct cftype,
not struct cgroup.

After the patch, it returns EINVAL if we try to get cgroupstats
from a normal file.

Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19 18:50:00 -08:00
Hugh Dickins
966c8c12dc sprint_symbol(): use less stack
sprint_symbol(), itself used when dumping stacks, has been wasting 128
bytes of stack: lookup the symbol directly into the buffer supplied by the
caller, instead of using a locally declared namebuf.

I believe the name != buffer strcpy() is obsolete: the design here dates
from when module symbol lookup pointed into a supposedly const but sadly
volatile table; nowadays it copies, but an uncalled strcpy() looks better
here than the risk of a recursive BUG_ON().

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19 18:49:58 -08:00