The current logic first clones the extent array and sorts both copies, then
maps the lower IDs of the forward mapping into the lower namespace, but
doesn't map the lower IDs of the reverse mapping.
This means that code in a nested user namespace with >5 extents will see
incorrect IDs. It also breaks some access checks, like
inode_owner_or_capable() and privileged_wrt_inode_uidgid(), so a process
can incorrectly appear to be capable relative to an inode.
To fix it, we have to make sure that the "lower_first" members of extents
in both arrays are translated; and we have to make sure that the reverse
map is sorted *after* the translation (since otherwise the translation can
break the sorting).
This is CVE-2018-18955.
Fixes: 6397fac491 ("userns: bump idmap limits to 340")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Tested-by: Eric W. Biederman <ebiederm@xmission.com>
Reviewed-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The old code would hold the userns_state_mutex indefinitely if
memdup_user_nul stalled due to e.g. a userfault region. Prevent that by
moving the memdup_user_nul in front of the mutex_lock().
Note: This changes the error precedence of invalid buf/count/*ppos vs
map already written / capabilities missing.
Fixes: 22d917d80e ("userns: Rework the user_namespace adding uid/gid...")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Christian Brauner <christian@brauner.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed for a mount
done with user namespace root permissions. In such cases allow_other should
not allow users outside the userns to access the mount as doing so would
give the unprivileged user the ability to manipulate processes it would
otherwise be unable to manipulate. Restrict allow_other to apply to users
in the same userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a module.
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull user namespace update from Eric Biederman:
"The only change that is production ready this round is the work to
increase the number of uid and gid mappings a user namespace can
support from 5 to 340.
This code was carefully benchmarked and it was confirmed that in the
existing cases the performance remains the same. In the worst case
with 340 mappings an cache cold stat times go from 158ns to 248ns.
That is noticable but still quite small, and only the people who are
doing crazy things pay the cost.
This work uncovered some documentation and cleanup opportunities in
the mapping code, and patches to make those cleanups and improve the
documentation will be coming in the next merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Simplify insert_extent
userns: Make map_id_down a wrapper for map_id_range_down
userns: Don't read extents twice in m_start
userns: Simplify the user and group mapping functions
userns: Don't special case a count of 0
userns: bump idmap limits to 340
userns: use union in {g,u}idmap struct
Consolidate the code to write to the new mapping at the end of the
function to remove the duplication. Move the increase in the number
of mappings into insert_extent, keeping the logic together.
Just a small increase in readability and maintainability.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There is no good reason for this code duplication, the number of cache
line accesses not the number of instructions are the bottleneck in
this code.
Therefore simplify maintenance by removing unnecessary code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This is important so reading /proc/<pid>/{uid_map,gid_map,projid_map} while
the map is being written does not do strange things.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Consolidate reading the number of extents and computing the return
value in the map_id_down, map_id_range_down and map_id_range.
This removal of one read of extents makes one smp_rmb unnecessary
and makes the code safe it is executed during the map write. Reading
the number of extents twice and depending on the result being the same
is not safe, as it could be 0 the first time and > 5 the second time,
which would lead to misinterpreting the union fields.
The consolidation of the return value just removes a duplicate
caluculation which should make it easier to understand and maintain
the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are quite some use cases where users run into the current limit for
{g,u}id mappings. Consider a user requesting us to map everything but 999, and
1001 for a given range of 1000000000 with a sub{g,u}id layout of:
some-user:100000:1000000000
some-user:999:1
some-user:1000:1
some-user:1001:1
some-user:1002:1
This translates to:
MAPPING-TYPE | CONTAINER | HOST | RANGE |
-------------|-----------|---------|-----------|
uid | 999 | 999 | 1 |
uid | 1001 | 1001 | 1 |
uid | 0 | 1000000 | 999 |
uid | 1000 | 1001000 | 1 |
uid | 1002 | 1001002 | 999998998 |
------------------------------------------------
gid | 999 | 999 | 1 |
gid | 1001 | 1001 | 1 |
gid | 0 | 1000000 | 999 |
gid | 1000 | 1001000 | 1 |
gid | 1002 | 1001002 | 999998998 |
which is already the current limit.
As discussed at LPC simply bumping the number of limits is not going to work
since this would mean that struct uid_gid_map won't fit into a single cache-line
anymore thereby regressing performance for the base-cases. The same problem
seems to arise when using a single pointer. So the idea is to use
struct uid_gid_extent {
u32 first;
u32 lower_first;
u32 count;
};
struct uid_gid_map { /* 64 bytes -- 1 cache line */
u32 nr_extents;
union {
struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
struct {
struct uid_gid_extent *forward;
struct uid_gid_extent *reverse;
};
};
};
For the base cases we will only use the struct uid_gid_extent extent member. If
we go over UID_GID_MAP_MAX_BASE_EXTENTS mappings we perform a single 4k
kmalloc() which means we can have a maximum of 340 mappings
(340 * size(struct uid_gid_extent) = 4080). For the latter case we use two
pointers "forward" and "reverse". The forward pointer points to an array sorted
by "first" and the reverse pointer points to an array sorted by "lower_first".
We can then perform binary search on those arrays.
Performance Testing:
When Eric introduced the extent-based struct uid_gid_map approach he measured
the performanc impact of his idmap changes:
> My benchmark consisted of going to single user mode where nothing else was
> running. On an ext4 filesystem opening 1,000,000 files and looping through all
> of the files 1000 times and calling fstat on the individuals files. This was
> to ensure I was benchmarking stat times where the inodes were in the kernels
> cache, but the inode values were not in the processors cache. My results:
> v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
I used an identical approach on my laptop. Here's a thorough description of what
I did. I built a 4.14.0-rc4 mainline kernel with my new idmap patches applied. I
booted into single user mode and used an ext4 filesystem to open/create
1,000,000 files. Then I looped through all of the files calling fstat() on each
of them 1000 times and calculated the mean fstat() time for a single file. (The
test program can be found below.)
Here are the results. For fun, I compared the first version of my patch which
scaled linearly with the new version of the patch:
| # MAPPINGS | PATCH-V1 | PATCH-NEW |
|--------------|------------|-----------|
| 0 mappings | 158 ns | 158 ns |
| 1 mappings | 164 ns | 157 ns |
| 2 mappings | 170 ns | 158 ns |
| 3 mappings | 175 ns | 161 ns |
| 5 mappings | 187 ns | 165 ns |
| 10 mappings | 218 ns | 199 ns |
| 50 mappings | 528 ns | 218 ns |
| 100 mappings | 980 ns | 229 ns |
| 200 mappings | 1880 ns | 239 ns |
| 300 mappings | 2760 ns | 240 ns |
| 340 mappings | not tested | 248 ns |
Here's the test program I used. I asked Eric what he did and this is a more
"advanced" implementation of the idea. It's pretty straight-forward:
#define __GNU_SOURCE
#define __STDC_FORMAT_MACROS
#include <errno.h>
#include <dirent.h>
#include <fcntl.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
int main(int argc, char *argv[])
{
int ret;
size_t i, k;
int fd[1000000];
int times[1000];
char pathname[4096];
struct stat st;
struct timeval t1, t2;
uint64_t time_in_mcs;
uint64_t sum = 0;
if (argc != 2) {
fprintf(stderr, "Please specify a directory where to create "
"the test files\n");
exit(EXIT_FAILURE);
}
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
if (fd[i] < 0) {
ssize_t j;
for (j = i; j >= 0; j--)
close(fd[j]);
exit(EXIT_FAILURE);
}
}
for (k = 0; k < 1000; k++) {
ret = gettimeofday(&t1, NULL);
if (ret < 0)
goto close_all;
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
ret = fstat(fd[i], &st);
if (ret < 0)
goto close_all;
}
ret = gettimeofday(&t2, NULL);
if (ret < 0)
goto close_all;
time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
(1000000 * t1.tv_sec + t1.tv_usec);
printf("Total time in micro seconds: %" PRIu64 "\n",
time_in_mcs);
printf("Total time in nanoseconds: %" PRIu64 "\n",
time_in_mcs * 1000);
printf("Time per file in nanoseconds: %" PRIu64 "\n",
(time_in_mcs * 1000) / 1000000);
times[k] = (time_in_mcs * 1000) / 1000000;
}
close_all:
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
close(fd[i]);
if (ret < 0)
exit(EXIT_FAILURE);
for (k = 0; k < 1000; k++) {
sum += times[k];
}
printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);
exit(EXIT_SUCCESS);;
}
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
CC: Serge Hallyn <serge@hallyn.com>
CC: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is pointless and confusing to allow a pid namespace hierarchy and
the user namespace hierarchy to get out of sync. The owner of a child
pid namespace should be the owner of the parent pid namespace or
a descendant of the owner of the parent pid namespace.
Otherwise it is possible to construct scenarios where a process has a
capability over a parent pid namespace but does not have the
capability over a child pid namespace. Which confusingly makes
permission checks non-transitive.
It requires use of setns into a pid namespace (but not into a user
namespace) to create such a scenario.
Add the function in_userns to help in making this determination.
v2: Optimized in_userns by using level as suggested
by: Kirill Tkhai <ktkhai@virtuozzo.com>
Ref: 49f4d8b93c ("pidns: Capture the user namespace and filter ns_last_pid")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
From: Andrey Vagin <avagin@openvz.org>
Each namespace has an owning user namespace and now there is not way
to discover these relationships.
Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.
Why we may want to know relationships between namespaces?
One use would be visualization, in order to understand the running
system. Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?
One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.
There [1] was a discussion about which interface to choose to determing
relationships between namespaces.
Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble. I think this may actually a case for creating ioctls
> for these two cases. Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.
Here is an implementaions of these ioctl-s.
$ man man7/namespaces.7
...
Since Linux 4.X, the following ioctl(2) calls are supported for
namespace file descriptors. The correct syntax is:
fd = ioctl(ns_fd, ioctl_type);
where ioctl_type is one of the following:
NS_GET_USERNS
Returns a file descriptor that refers to an owning user namesā
pace.
NS_GET_PARENT
Returns a file descriptor that refers to a parent namespace.
This ioctl(2) can be used for pid and user namespaces. For
user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
meaning.
In addition to generic ioctl(2) errors, the following specific ones
can occur:
EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
EPERM The requested namespace is outside of the current namespace
scope.
[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101
Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
outside of the init namespace, so we can return EPERM in this case too.
> The fewer special cases the easier the code is to get
> correct, and the easier it is to read. // Eric
Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: "W. Trevor King" <wking@tremily.us>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships.
In a future we will use this interface to dump and restore nested
namespaces.
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Return -EPERM if an owning user namespace is outside of a process
current user namespace.
v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
This special cases was removed from this version. There is nothing
outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.
The best general error code I have found for hitting a resource limit
is ENOSPC. It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The same kind of recursive sane default limit and policy
countrol that has been implemented for the user namespace
is desirable for the other namespaces, so generalize
the user namespace refernce count into a ucount.
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add a structure that is per user and per user ns and use it to hold
the count of user namespaces. This makes prevents one user from
creating denying service to another user by creating the maximum
number of user namespaces.
Rename the sysctl export of the maximum count from
/proc/sys/userns/max_user_namespaces to /proc/sys/user/max_user_namespaces
to reflect that the count is now per user.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Export the export the maximum number of user namespaces as
/proc/sys/userns/max_user_namespaces.
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Limit per userns sysctls to only be opened for write by a holder
of CAP_SYS_RESOURCE.
Add all of the necessary boilerplate for having per user namespace
sysctls.
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add the necessary boiler plate to move freeing of user namespaces into
work queue and thus into process context where things can sleep.
This is a necessary precursor to per user namespace sysctls.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.
Add a new helper function, current_in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.
--EWB Replaced in_userns with the simpler current_in_userns.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>