Code that is obj-y (always built-in) or dependent on a bool Kconfig
(built-in or absent) can never be modular. So using module_init as an
alias for __initcall can be somewhat misleading.
Fix these up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h
to obviously non-modular code, and that would be a worse thing.
The audit targets the following module_init users for change:
kernel/user.c obj-y
kernel/kexec.c bool KEXEC (one instance per arch)
kernel/profile.c bool PROFILING
kernel/hung_task.c bool DETECT_HUNG_TASK
kernel/sched/stats.c bool SCHEDSTATS
kernel/user_namespace.c bool USER_NS
Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of subsys_initcall (which makes sense for these
files) will thus change this registration from level 6-device to level
4-subsys (i.e. slightly earlier). However no observable impact of that
difference has been observed during testing.
Also, two instances of missing ";" at EOL are fixed in kexec.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for per-user_namespace registers of persistent per-UID kerberos
caches held within the kernel.
This allows the kerberos cache to be retained beyond the life of all a user's
processes so that the user's cron jobs can work.
The kerberos cache is envisioned as a keyring/key tree looking something like:
struct user_namespace
\___ .krb_cache keyring - The register
\___ _krb.0 keyring - Root's Kerberos cache
\___ _krb.5000 keyring - User 5000's Kerberos cache
\___ _krb.5001 keyring - User 5001's Kerberos cache
\___ tkt785 big_key - A ccache blob
\___ tkt12345 big_key - Another ccache blob
Or possibly:
struct user_namespace
\___ .krb_cache keyring - The register
\___ _krb.0 keyring - Root's Kerberos cache
\___ _krb.5000 keyring - User 5000's Kerberos cache
\___ _krb.5001 keyring - User 5001's Kerberos cache
\___ tkt785 keyring - A ccache
\___ krbtgt/REDHAT.COM@REDHAT.COM big_key
\___ http/REDHAT.COM@REDHAT.COM user
\___ afs/REDHAT.COM@REDHAT.COM user
\___ nfs/REDHAT.COM@REDHAT.COM user
\___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
\___ http/KERNEL.ORG@KERNEL.ORG big_key
What goes into a particular Kerberos cache is entirely up to userspace. Kernel
support is limited to giving you the Kerberos cache keyring that you want.
The user asks for their Kerberos cache by:
krb_cache = keyctl_get_krbcache(uid, dest_keyring);
The uid is -1 or the user's own UID for the user's own cache or the uid of some
other user's cache (requires CAP_SETUID). This permits rpc.gssd or whatever to
mess with the cache.
The cache returned is a keyring named "_krb.<uid>" that the possessor can read,
search, clear, invalidate, unlink from and add links to. Active LSMs get a
chance to rule on whether the caller is permitted to make a link.
Each uid's cache keyring is created when it first accessed and is given a
timeout that is extended each time this function is called so that the keyring
goes away after a while. The timeout is configurable by sysctl but defaults to
three days.
Each user_namespace struct gets a lazily-created keyring that serves as the
register. The cache keyrings are added to it. This means that standard key
search and garbage collection facilities are available.
The user_namespace struct's register goes away when it does and anything left
in it is then automatically gc'd.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Simo Sorce <simo@redhat.com>
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.
Verify that the mounted filesystem is not covered in any significant
way. I would love to verify that the previously mounted filesystem
has no mounts on top but there are at least the directories
/proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
for other filesystems to mount on top of.
Refactor the test into a function named fs_fully_visible and call that
function from the mount routines of proc and sysfs. This makes this
test local to the filesystems involved and the results current of when
the mounts take place, removing a weird threading of the user
namespace, the mount namespace and the filesystems themselves.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Only allow unprivileged mounts of proc and sysfs if they are already
mounted when the user namespace is created.
proc and sysfs are interesting because they have content that is
per namespace, and so fresh mounts are needed when new namespaces
are created while at the same time proc and sysfs have content that
is shared between every instance.
Respect the policy of who may see the shared content of proc and sysfs
by only allowing new mounts if there was an existing mount at the time
the user namespace was created.
In practice there are only two interesting cases: proc and sysfs are
mounted at their usual places, proc and sysfs are not mounted at all
(some form of mount namespace jail).
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When freeing a deeply nested user namespace free_user_ns calls
put_user_ns on it's parent which may in turn call free_user_ns again.
When -fno-optimize-sibling-calls is passed to gcc one stack frame per
user namespace is left on the stack, potentially overflowing the
kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
so we can't count on gcc to optimize this code.
Remove struct kref and use a plain atomic_t. Making the code more
flexible and easier to comprehend. Make the loop in free_user_ns
explict to guarantee that the stack does not overflow with
CONFIG_FRAME_POINTER enabled.
I have tested this fix with a simple program that uses unshare to
create a deeply nested user namespace structure and then calls exit.
With 1000 nesteuser namespaces before this change running my test
program causes the kernel to die a horrible death. With 10,000,000
nested user namespaces after this change my test program runs to
completion and causes no harm.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Pointed-out-by: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.
A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.
This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.
We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.
I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Implement kprojid_t a cousin of the kuid_t and kgid_t.
The per user namespace mapping of project id values can be set with
/proc/<pid>/projid_map.
A full compliment of helpers is provided: make_kprojid, from_kprojid,
from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
projid_eq, projid_lt.
Project identifiers are part of the generic disk quota interface,
although it appears only xfs implements project identifiers currently.
The xfs code allows anyone who has permission to set the project
identifier on a file to use any project identifier so when
setting up the user namespace project identifier mappings I do
not require a capability.
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
On 32bit builds gcc says:
kernel/user.c:30:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default]
kernel/user.c:38:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default]
Silence gcc by changing the constant 4294967295 to 4294967295U.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Convert the old uid mapping functions into compatibility wrappers
- Add a uid/gid mapping layer from user space uid and gids to kernel
internal uids and gids that is extent based for simplicty and speed.
* Working with number space after mapping uids/gids into their kernel
internal version adds only mapping complexity over what we have today,
leaving the kernel code easy to understand and test.
- Add proc files /proc/self/uid_map /proc/self/gid_map
These files display the mapping and allow a mapping to be added
if a mapping does not exist.
- Allow entering the user namespace without a uid or gid mapping.
Since we are starting with an existing user our uids and gids
still have global mappings so are still valid and useful they just don't
have local mappings. The requirement for things to work are global uid
and gid so it is odd but perfectly fine not to have a local uid
and gid mapping.
Not requiring global uid and gid mappings greatly simplifies
the logic of setting up the uid and gid mappings by allowing
the mappings to be set after the namespace is created which makes the
slight weirdness worth it.
- Make the mappings in the initial user namespace to the global
uid/gid space explicit. Today it is an identity mapping
but in the future we may want to twist this for debugging, similar
to what we do with jiffies.
- Document the memory ordering requirements of setting the uid and
gid mappings. We only allow the mappings to be set once
and there are no pointers involved so the requirments are
trivial but a little atypical.
Performance:
In this scheme for the permission checks the performance is expected to
stay the same as the actuall machine instructions should remain the same.
The worst case I could think of is ls -l on a large directory where
all of the stat results need to be translated with from kuids and
kgids to uids and gids. So I benchmarked that case on my laptop
with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.
My benchmark consisted of going to single user mode where nothing else
was running. On an ext4 filesystem opening 1,000,000 files and looping
through all of the files 1000 times and calling fstat on the
individuals files. This was to ensure I was benchmarking stat times
where the inodes were in the kernels cache, but the inode values were
not in the processors cache. My results:
v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
All of the configurations ran in roughly 120ns when I performed tests
that ran in the cpu cache.
So in summary the performance impact is:
1ns improvement in the worst case with user namespace support compiled out.
8ns aka 5% slowdown in the worst case with user namespace support compiled in.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Transform userns->creator from a user_struct reference to a simple
kuid_t, kgid_t pair.
In cap_capable this allows the check to see if we are the creator of
a namespace to become the classic suser style euid permission check.
This allows us to remove the need for a struct cred in the mapping
functions and still be able to dispaly the user namespace creators
uid and gid as 0.
- Remove the now unnecessary delayed_work in free_user_ns.
All that is left for free_user_ns to do is to call kmem_cache_free
and put_user_ns. Those functions can be called in any context
so call them directly from free_user_ns removing the need for delayed work.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Modify alloc_uid to take a kuid and make the user hash table global.
Stop holding a reference to the user namespace in struct user_struct.
This simplifies the code and makes the per user accounting not
care about which user namespace a uid happens to appear in.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
With a user_ns reference in struct cred the only user of the user namespace
reference in struct user_struct is to keep the uid hash table alive.
The user_namespace reference in struct user_struct will be going away soon, and
I have removed all of the references. Rename the field from user_ns to _user_ns
so that the compiler can verify nothing follows the user struct to the user
namespace anymore.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The changed files were only including linux/module.h for the
EXPORT_SYMBOL infrastructure, and nothing else. Revector them
onto the isolated export header for faster compile times.
Nothing to see here but a whole lot of instances of:
-#include <linux/module.h>
+#include <linux/export.h>
This commit is only changing the kernel dir; next targets
will probably be mm, fs, the arch dirs, etc.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The expected course of development for user namespaces targeted
capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.
Goals:
- Make it safe for an unprivileged user to unshare namespaces. They
will be privileged with respect to the new namespace, but this should
only include resources which the unprivileged user already owns.
- Provide separate limits and accounting for userids in different
namespaces.
Status:
Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
CAP_SETGID capabilities. What this gets you is a whole new set of
userids, meaning that user 500 will have a different 'struct user' in
your namespace than in other namespaces. So any accounting information
stored in struct user will be unique to your namespace.
However, throughout the kernel there are checks which
- simply check for a capability. Since root in a child namespace
has all capabilities, this means that a child namespace is not
constrained.
- simply compare uid1 == uid2. Since these are the integer uids,
uid 500 in namespace 1 will be said to be equal to uid 500 in
namespace 2.
As a result, the lxc implementation at lxc.sf.net does not use user
namespaces. This is actually helpful because it leaves us free to
develop user namespaces in such a way that, for some time, user
namespaces may be unuseful.
Bugs aside, this patchset is supposed to not at all affect systems which
are not actively using user namespaces, and only restrict what tasks in
child user namespace can do. They begin to limit privilege to a user
namespace, so that root in a container cannot kill or ptrace tasks in the
parent user namespace, and can only get world access rights to files.
Since all files currently belong to the initila user namespace, that means
that child user namespaces can only get world access rights to *all*
files. While this temporarily makes user namespaces bad for system
containers, it starts to get useful for some sandboxing.
I've run the 'runltplite.sh' with and without this patchset and found no
difference.
This patch:
copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
will have the new user namespace as its owner. That is what we want,
since we want root in that new userns to be able to have privilege over
it.
Changelog:
Feb 15: don't set uts_ns->user_ns if we didn't create
a new uts_ns.
Feb 23: Move extern init_user_ns declaration from
init/version.c to utsname.h.
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When racing on adding into user cache, the new allocated from mm slab
is freed without putting user namespace.
Since the user namespace is already operated by getting, putting has
to be issued.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ingo triggered the following warning:
WARNING: at lib/debugobjects.c:255 debug_print_object+0x42/0x50()
Hardware name: System Product Name
ODEBUG: init active object type: timer_list
Modules linked in:
Pid: 2619, comm: dmesg Tainted: G W 2.6.32-rc5-tip+ #5298
Call Trace:
[<81035443>] warn_slowpath_common+0x6a/0x81
[<8120e483>] ? debug_print_object+0x42/0x50
[<81035498>] warn_slowpath_fmt+0x29/0x2c
[<8120e483>] debug_print_object+0x42/0x50
[<8120ec2a>] __debug_object_init+0x279/0x2d7
[<8120ecb3>] debug_object_init+0x13/0x18
[<810409d2>] init_timer_key+0x17/0x6f
[<81041526>] free_uid+0x50/0x6c
[<8104ed2d>] put_cred_rcu+0x61/0x72
[<81067fac>] rcu_do_batch+0x70/0x121
debugobjects warns about an enqueued timer being initialized. If
CONFIG_USER_SCHED=y the user management code uses delayed work to
remove the user from the hash table and tear down the sysfs objects.
free_uid is called from RCU and initializes/schedules delayed work if
the usage count of the user_struct is 0. The init/schedule happens
outside of the uidhash_lock protected region which allows a concurrent
caller of find_user() to reference the about to be destroyed
user_struct w/o preventing the work from being scheduled. If the next
free_uid call happens before the work timer expired then the active
timer is initialized and the work scheduled again.
The race was introduced in commit 5cb350ba (sched: group scheduling,
sysfs tunables) and made more prominent by commit 3959214f (sched:
delayed cleanup of user_struct)
Move the init/schedule_delayed_work inside of the uidhash_lock
protected region to prevent the race.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: stable@kernel.org