Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.
The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.
Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users
The check if the queue is full and adding current to the wait queue of
pending msgsnd() operations (ss_add()) must be atomic.
Otherwise:
- the thread that performs msgsnd() finds a full queue and decides to
sleep.
- the thread that performs msgrcv() first reads all messages from the
queue and then sleeps, because the queue is empty.
- the msgrcv() calls do not perform any wakeups, because the msgsnd()
task has not yet called ss_add().
- then the msgsnd()-thread first calls ss_add() and then sleeps.
Net result: msgsnd() and msgrcv() both sleep forever.
Observed with msgctl08 from ltp with a preemptible kernel.
Fix: Call ipc_lock_object() before performing the check.
The patch also moves security_msg_queue_msgsnd() under ipc_lock_object:
- msgctl(IPC_SET) explicitely mentions that it tries to expunge any
pending operations that are not allowed anymore with the new
permissions. If security_msg_queue_msgsnd() is called without locks,
then there might be races.
- it makes the patch much simpler.
Reported-and-tested-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: stable@vger.kernel.org # for 3.11
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID. For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing. So remove nsown_capable.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
According to 'man msgrcv': "If msgtyp is less than 0, the first message of
the lowest type that is less than or equal to the absolute value of msgtyp
shall be received."
Bug: The kernel only returns a message if its type is 1; other messages
with type < abs(msgtype) will never get returned.
Fix: After having traversed the list to find the first message with the
lowest type, we need to actually return that message.
This regression was introduced by commit daaf74cf08 ("ipc: refactor
msg list search into separate function")
Signed-off-by: Svenning Soerensen <sss@secomea.dk>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup: Some minor points that I noticed while writing the previous
patches
1) The name try_atomic_semop() is misleading: The function performs the
operation (if it is possible).
2) Some documentation updates.
No real code change, a rename and documentation changes.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sem_otime contains the time of the last semaphore operation that
completed successfully. Every operation updates this value, thus access
from multiple cpus can cause thrashing.
Therefore the patch replaces the variable with a per-semaphore variable.
The per-array sem_otime is only calculated when required.
No performance improvement on a single-socket i3 - only important for
larger systems.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two places that can contain alter operations:
- the global queue: sma->pending_alter
- the per-semaphore queues: sma->sem_base[].pending_alter.
Since one of the queues must be processed first, this causes an odd
priorization of the wakeups: complex operations have priority over
simple ops.
The patch restores the behavior of linux <=3.0.9: The longest waiting
operation has the highest priority.
This is done by using only one queue:
- if there are complex ops, then sma->pending_alter is used.
- otherwise, the per-semaphore queues are used.
As a side effect, do_smart_update_queue() becomes much simpler: no more
goto logic.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce separate queues for operations that do not modify the
semaphore values. Advantages:
- Simpler logic in check_restart().
- Faster update_queue(): Right now, all wait-for-zero operations are
always tested, even if the semaphore value is not 0.
- wait-for-zero gets again priority, as in linux <=3.0.9
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As now each semaphore has its own spinlock and parallel operations are
possible, give each semaphore its own cacheline.
On a i3 laptop, this gives up to 28% better performance:
#semscale 10 | grep "interleave 2"
- before:
Cpus 1, interleave 2 delay 0: 36109234 in 10 secs
Cpus 2, interleave 2 delay 0: 55276317 in 10 secs
Cpus 3, interleave 2 delay 0: 62411025 in 10 secs
Cpus 4, interleave 2 delay 0: 81963928 in 10 secs
-after:
Cpus 1, interleave 2 delay 0: 35527306 in 10 secs
Cpus 2, interleave 2 delay 0: 70922909 in 10 secs <<< + 28%
Cpus 3, interleave 2 delay 0: 80518538 in 10 secs
Cpus 4, interleave 2 delay 0: 89115148 in 10 secs <<< + 8.7%
i3, with 2 cores and with hyperthreading enabled. Interleave 2 in order
use first the full cores. HT partially hides the delay from cacheline
trashing, thus the improvement is "only" 8.7% if 4 threads are running.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enforce that ipc_rcu_alloc returns a cacheline aligned pointer on SMP.
Rationale:
The SysV sem code tries to move the main spinlock into a seperate
cacheline (____cacheline_aligned_in_smp). This works only if
ipc_rcu_alloc returns cacheline aligned pointers. vmalloc and kmalloc
return cacheline algined pointers, the implementation of ipc_rcu_alloc
breaks that.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While the INFO cmd doesn't take the ipc lock, the STAT commands do
acquire it unnecessarily. We can do the permissions and security checks
only holding the rcu lock.
This function now mimics semctl_nolock().
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add msq_obtain_object() and msq_obtain_object_check(), which will allow
us to get the ipc object without acquiring the lock. Just as with
semaphores, these functions are basically wrappers around
ipc_obtain_object*().
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to semctl, when calling msgctl, the *_INFO and *_STAT commands
can be performed without acquiring the ipc object.
Add a msgctl_nolock() function and move the logic of *_INFO and *_STAT
out of msgctl(). This change still takes the lock and it will be
properly lockless in the next patch
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function currently acquires both the rw_mutex and the rcu lock on
successful lookups, leaving the callers to explicitly unlock them,
creating another two level locking situation.
Make the callers (including those that still use ipcctl_pre_down())
explicitly lock and unlock the rwsem and rcu lock.
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset continues the work that began in the sysv ipc semaphore
scaling series, see
https://lkml.org/lkml/2013/3/20/546
Just like semaphores used to be, sysv shared memory and msg queues also
abuse the ipc lock, unnecessarily holding it for operations such as
permission and security checks.
This patchset mostly deals with mqueues, and while shared mem can be
done in a very similar way, I want to get these patches out in the open
first. It also does some pending cleanups, mostly focused on the two
level locking we have in ipc code, taking care of ipc_addid() and
ipcctl_pre_down_nolock() - yes there are still functions that need to be
updated as well.
This patch:
Make all callers explicitly take and release the RCU read lock.
This addresses the two level locking seen in newary(), newseg() and
newqueue(). For the last two, explicitly unlock the ipc object and the
rcu lock, instead of calling the custom shm_unlock and msg_unlock
functions. The next patch will deal with the open coded locking for
->perm.lock
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The old audit PATH records for mq_open looked like this:
type=PATH msg=audit(1366282323.982:869): item=1 name=(null) inode=6777
dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
obj=system_u:object_r:tmpfs_t:s15:c0.c1023
type=PATH msg=audit(1366282323.982:869): item=0 name="test_mq" inode=26732
dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023
...with the audit related changes that went into 3.7, they now look like this:
type=PATH msg=audit(1366282236.776:3606): item=2 name=(null) inode=66655
dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023
type=PATH msg=audit(1366282236.776:3606): item=1 name=(null) inode=6926
dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
obj=system_u:object_r:tmpfs_t:s15:c0.c1023
type=PATH msg=audit(1366282236.776:3606): item=0 name="test_mq"
Both of these look wrong to me. As Steve Grubb pointed out:
"What we need is 1 PATH record that identifies the MQ. The other PATH
records probably should not be there."
Fix it to record the mq root as a parent, and flag it such that it
should be hidden from view when the names are logged, since the root of
the mq filesystem isn't terribly interesting. With this change, we get
a single PATH record that looks more like this:
type=PATH msg=audit(1368021604.836:484): item=0 name="test_mq" inode=16914
dev=00:0c mode=0100644 ouid=0 ogid=0 rdev=00:00
obj=unconfined_u:object_r:user_tmpfs_t:s0
In order to do this, a new audit_inode_parent_hidden() function is
added. If we do it this way, then we avoid having the existing callers
of audit_inode needing to do any sort of flag conversion if auditing is
inactive.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Jiri Jaburek <jjaburek@redhat.com>
Cc: Steve Grubb <sgrubb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_smart_update_queue() is called when an operation (semop,
semctl(SETVAL), semctl(SETALL), ...) modified the array. It must check
which of the sleeping tasks can proceed.
do_smart_update_queue() missed a few wakeups:
- if a sleeping complex op was completed, then all per-semaphore queues
must be scanned - not only those that were modified by *sops
- if a sleeping simple op proceeded, then the global queue must be
scanned again
And:
- the test for "|sops == NULL) before scanning the global queue is not
required: If the global queue is empty, then it doesn't need to be
scanned - regardless of the reason for calling do_smart_update_queue()
The patch is not optimized, i.e. even completing a wait-for-zero
operation causes a rescan. This is done to keep the patch as simple as
possible.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>