Commit Graph

194 Commits

Author SHA1 Message Date
Linus Torvalds
f5b7769eb0 Revert "debugfs: inode: debugfs_create_dir uses mode permission from parent"
This reverts commit 95cde3c599.

The commit had good intentions, but it breaks kvm-tool and qemu-kvm.

With it in place, "lkvm run" just fails with

  Error: KVM_CREATE_VM ioctl
  Warning: Failed init: kvm__init

which isn't a wonderful error message, but bisection pinpointed the
problematic commit.

The problem is almost certainly due to the special kvm debugfs entries
created dynamically by kvm under /sys/kernel/debug/kvm/.  See
kvm_create_vm_debugfs()

Bisected-and-reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-12 20:52:16 -07:00
Thomas Richter
95cde3c599 debugfs: inode: debugfs_create_dir uses mode permission from parent
Currently function debugfs_create_dir() creates a new
directory in the debugfs (usually mounted /sys/kernel/debug)
with permission rwxr-xr-x. This is hard coded.

Change this to use the parent directory permission.

Output before the patch:
root@s8360047 ~]# tree -dp -L 1 /sys/kernel/debug/
/sys/kernel/debug/
├── [drwxr-xr-x]  bdi
├── [drwxr-xr-x]  block
├── [drwxr-xr-x]  dasd
├── [drwxr-xr-x]  device_component
├── [drwxr-xr-x]  extfrag
├── [drwxr-xr-x]  hid
├── [drwxr-xr-x]  kprobes
├── [drwxr-xr-x]  kvm
├── [drwxr-xr-x]  memblock
├── [drwxr-xr-x]  pm_qos
├── [drwxr-xr-x]  qdio
├── [drwxr-xr-x]  s390
├── [drwxr-xr-x]  s390dbf
└── [drwx------]  tracing

14 directories
[root@s8360047 linux]#

Output after the patch:
[root@s8360047 ~]# tree -dp -L 1 /sys/kernel/debug/
sys/kernel/debug/
├── [drwx------]  bdi
├── [drwx------]  block
├── [drwx------]  dasd
├── [drwx------]  device_component
├── [drwx------]  extfrag
├── [drwx------]  hid
├── [drwx------]  kprobes
├── [drwx------]  kvm
├── [drwx------]  memblock
├── [drwx------]  pm_qos
├── [drwx------]  qdio
├── [drwx------]  s390
├── [drwx------]  s390dbf
└── [drwx------]  tracing

14 directories
[root@s8360047 linux]#

Here is the full diff output done with:
[root@s8360047 ~]# diff -u treefull.before treefull.after |
	sed 's-^- # -' > treefull.diff
 # --- treefull.before	2018-04-27 13:22:04.532824564 +0200
 # +++ treefull.after	2018-04-27 13:24:12.106182062 +0200
 # @@ -1,55 +1,55 @@
 #  /sys/kernel/debug/
 # -├── [drwxr-xr-x]  bdi
 # -│   ├── [drwxr-xr-x]  1:0
 # -│   ├── [drwxr-xr-x]  1:1
 # -│   ├── [drwxr-xr-x]  1:10
 # -│   ├── [drwxr-xr-x]  1:11
 # -│   ├── [drwxr-xr-x]  1:12
 # -│   ├── [drwxr-xr-x]  1:13
 # -│   ├── [drwxr-xr-x]  1:14
 # -│   ├── [drwxr-xr-x]  1:15
 # -│   ├── [drwxr-xr-x]  1:2
 # -│   ├── [drwxr-xr-x]  1:3
 # -│   ├── [drwxr-xr-x]  1:4
 # -│   ├── [drwxr-xr-x]  1:5
 # -│   ├── [drwxr-xr-x]  1:6
 # -│   ├── [drwxr-xr-x]  1:7
 # -│   ├── [drwxr-xr-x]  1:8
 # -│   ├── [drwxr-xr-x]  1:9
 # -│   └── [drwxr-xr-x]  94:0
 # -├── [drwxr-xr-x]  block
 # -├── [drwxr-xr-x]  dasd
 # -│   ├── [drwxr-xr-x]  0.0.e18a
 # -│   ├── [drwxr-xr-x]  dasda
 # -│   └── [drwxr-xr-x]  global
 # -├── [drwxr-xr-x]  device_component
 # -├── [drwxr-xr-x]  extfrag
 # -├── [drwxr-xr-x]  hid
 # -├── [drwxr-xr-x]  kprobes
 # -├── [drwxr-xr-x]  kvm
 # -├── [drwxr-xr-x]  memblock
 # -├── [drwxr-xr-x]  pm_qos
 # -├── [drwxr-xr-x]  qdio
 # -│   └── [drwxr-xr-x]  0.0.f5f2
 # -├── [drwxr-xr-x]  s390
 # -│   └── [drwxr-xr-x]  stsi
 # -├── [drwxr-xr-x]  s390dbf
 # -│   ├── [drwxr-xr-x]  0.0.e18a
 # -│   ├── [drwxr-xr-x]  cio_crw
 # -│   ├── [drwxr-xr-x]  cio_msg
 # -│   ├── [drwxr-xr-x]  cio_trace
 # -│   ├── [drwxr-xr-x]  dasd
 # -│   ├── [drwxr-xr-x]  kvm-trace
 # -│   ├── [drwxr-xr-x]  lgr
 # -│   ├── [drwxr-xr-x]  qdio_0.0.f5f2
 # -│   ├── [drwxr-xr-x]  qdio_error
 # -│   ├── [drwxr-xr-x]  qdio_setup
 # -│   ├── [drwxr-xr-x]  qeth_card_0.0.f5f0
 # -│   ├── [drwxr-xr-x]  qeth_control
 # -│   ├── [drwxr-xr-x]  qeth_msg
 # -│   ├── [drwxr-xr-x]  qeth_setup
 # -│   ├── [drwxr-xr-x]  vmcp
 # -│   └── [drwxr-xr-x]  vmur
 # +├── [drwx------]  bdi
 # +│   ├── [drwx------]  1:0
 # +│   ├── [drwx------]  1:1
 # +│   ├── [drwx------]  1:10
 # +│   ├── [drwx------]  1:11
 # +│   ├── [drwx------]  1:12
 # +│   ├── [drwx------]  1:13
 # +│   ├── [drwx------]  1:14
 # +│   ├── [drwx------]  1:15
 # +│   ├── [drwx------]  1:2
 # +│   ├── [drwx------]  1:3
 # +│   ├── [drwx------]  1:4
 # +│   ├── [drwx------]  1:5
 # +│   ├── [drwx------]  1:6
 # +│   ├── [drwx------]  1:7
 # +│   ├── [drwx------]  1:8
 # +│   ├── [drwx------]  1:9
 # +│   └── [drwx------]  94:0
 # +├── [drwx------]  block
 # +├── [drwx------]  dasd
 # +│   ├── [drwx------]  0.0.e18a
 # +│   ├── [drwx------]  dasda
 # +│   └── [drwx------]  global
 # +├── [drwx------]  device_component
 # +├── [drwx------]  extfrag
 # +├── [drwx------]  hid
 # +├── [drwx------]  kprobes
 # +├── [drwx------]  kvm
 # +├── [drwx------]  memblock
 # +├── [drwx------]  pm_qos
 # +├── [drwx------]  qdio
 # +│   └── [drwx------]  0.0.f5f2
 # +├── [drwx------]  s390
 # +│   └── [drwx------]  stsi
 # +├── [drwx------]  s390dbf
 # +│   ├── [drwx------]  0.0.e18a
 # +│   ├── [drwx------]  cio_crw
 # +│   ├── [drwx------]  cio_msg
 # +│   ├── [drwx------]  cio_trace
 # +│   ├── [drwx------]  dasd
 # +│   ├── [drwx------]  kvm-trace
 # +│   ├── [drwx------]  lgr
 # +│   ├── [drwx------]  qdio_0.0.f5f2
 # +│   ├── [drwx------]  qdio_error
 # +│   ├── [drwx------]  qdio_setup
 # +│   ├── [drwx------]  qeth_card_0.0.f5f0
 # +│   ├── [drwx------]  qeth_control
 # +│   ├── [drwx------]  qeth_msg
 # +│   ├── [drwx------]  qeth_setup
 # +│   ├── [drwx------]  vmcp
 # +│   └── [drwx------]  vmur
 #  └── [drwx------]  tracing
 #      ├── [drwxr-xr-x]  events
 #      │   ├── [drwxr-xr-x]  alarmtimer

Fixes: edac65eaf8 ("debugfs: take mode-dependent parts of debugfs_get_inode() into callers")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-14 16:48:18 +02:00
Andy Shevchenko
964f8363a1 debugfs: Re-use kstrtobool_from_user()
Re-use kstrtobool_from_user() instead of open coded variant.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-14 16:48:18 +02:00
Al Viro
cd1c0c9321 debugfs_lookup(): switch to lookup_one_len_unlocked()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:47 -04:00
Linus Torvalds
a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Al Viro
cfe39442ab use linux/poll.h instead of asm/poll.h
The only place that has any business including asm/poll.h
is linux/poll.h.  Fortunately, asm/poll.h had only been
included in 3 places beyond that one, and all of them
are trivial to switch to using linux/poll.h.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-01 16:23:11 -05:00
Al Viro
076ccb76e1 fs: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:05 -05:00
Al Viro
e6c8adca20 anntotate the places where ->poll() return values go
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:19:53 -05:00
Greg Kroah-Hartman
2b2d8788dd debugfs: Remove redundant license text
Now that the SPDX tag is in all debugfs files, that identifies the
license in a specific and legally-defined manner.  So the extra GPL text
wording can be removed as it is no longer needed at all.

This is done on a quest to remove the 700+ different ways that files in
the kernel describe the GPL license text.  And there's unneeded stuff
like the address (sometimes incorrect) for the FSF which is never
needed.

No copyright headers or other non-license-description text was removed.

Cc: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:03 +01:00
Greg Kroah-Hartman
3bce94fd5f debugfs: add SPDX identifiers to all debugfs files
It's good to have SPDX identifiers in all files to make it easier to
audit the kernel tree for correct licenses.

Update the debugfs files files with the correct SPDX license identifier
based on the license text in the file itself.  The SPDX identifier is a
legally binding shorthand, which can be used instead of the full boiler
plate text.

This work is based on a script and data from Thomas Gleixner, Philippe
Ombredanne, and Kate Stewart.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:03 +01:00
Nicolai Stange
7d39bc50c4 debugfs: defer debugfs_fsdata allocation to first usage
Currently, __debugfs_create_file allocates one struct debugfs_fsdata
instance for every file created. However, there are potentially many
debugfs file around, most of which are never touched by userspace.

Thus, defer the allocations to the first usage, i.e. to the first
debugfs_file_get().

A dentry's ->d_fsdata starts out to point to the "real", user provided
fops. After a debugfs_fsdata instance has been allocated (and the real
fops pointer has been moved over into its ->real_fops member),
->d_fsdata is changed to point to it from then on. The two cases are
distinguished by setting BIT(0) for the real fops case.

struct debugfs_fsdata's foremost purpose is to track active users and to
make debugfs_remove() block until they are done. Since no debugfs_fsdata
instance means no active users, make debugfs_remove() return immediately
in this case.

Take care of possible races between debugfs_file_get() and
debugfs_remove(): either debugfs_remove() must see a debugfs_fsdata
instance and thus wait for possible active users or debugfs_file_get() must
see a dead dentry and return immediately.

Make a dentry's ->d_release(), i.e. debugfs_release_dentry(), check whether
->d_fsdata is actually a debugfs_fsdata instance before kfree()ing it.

Similarly, make debugfs_real_fops() check whether ->d_fsdata is actually
a debugfs_fsdata instance before returning it, otherwise emit a warning.

The set of possible error codes returned from debugfs_file_get() has grown
from -EIO to -EIO and -ENOMEM. Make open_proxy_open() and full_proxy_open()
pass the -ENOMEM onwards to their callers.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:03 +01:00
Nicolai Stange
154b9d7512 debugfs: call debugfs_real_fops() only after debugfs_file_get()
The current implementation of debugfs_real_fops() relies on a
debugfs_fsdata instance to be installed at ->d_fsdata.

With future patches introducing lazy allocation of these, this requirement
will be guaranteed to be fullfilled only inbetween a
debugfs_file_get()/debugfs_file_put() pair.

The full proxies' fops implemented by debugfs happen to be the only
offenders. Fix them up by moving their debugfs_real_fops() calls past those
to debugfs_file_get().

full_proxy_release() is special as it doesn't invoke debugfs_file_get() at
all. Leave it alone for now.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:03 +01:00
Nicolai Stange
c9afbec270 debugfs: purge obsolete SRCU based removal protection
Purge the SRCU based file removal race protection in favour of the new,
refcount based debugfs_file_get()/debugfs_file_put() API.

Fixes: 49d200deaa ("debugfs: prevent access to removed files' private data")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:02 +01:00
Nicolai Stange
69d29f9e6a debugfs: convert to debugfs_file_get() and -put()
Convert all calls to the now obsolete debugfs_use_file_start() and
debugfs_use_file_finish() from the debugfs core itself to the new
debugfs_file_get() and debugfs_file_put() API.

Fixes: 49d200deaa ("debugfs: prevent access to removed files' private data")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:02 +01:00
Nicolai Stange
055ab8e3e3 debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
Currently, debugfs_real_fops() is annotated with a
__must_hold(&debugfs_srcu) sparse annotation.

With the conversion of the SRCU based protection of users against
concurrent file removals to a per-file refcount based scheme, this becomes
wrong.

Drop this annotation.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:02 +01:00
Nicolai Stange
e9117a5a4b debugfs: implement per-file removal protection
Since commit 49d200deaa ("debugfs: prevent access to removed files'
private data"), accesses to a file's private data are protected from
concurrent removal by covering all file_operations with a SRCU read section
and sychronizing with those before returning from debugfs_remove() by means
of synchronize_srcu().

As pointed out by Johannes Berg, there are debugfs files with forever
blocking file_operations. Their corresponding SRCU read side sections would
block any debugfs_remove() forever as well, even unrelated ones. This
results in a livelock. Because a remover can't cancel any indefinite
blocking within foreign files, this is a problem.

Resolve this by introducing support for more granular protection on a
per-file basis.

This is implemented by introducing an  'active_users' refcount_t to the
per-file struct debugfs_fsdata state. At file creation time, it is set to
one and a debugfs_remove() will drop that initial reference. The new
debugfs_file_get() and debugfs_file_put(), intended to be used in place of
former debugfs_use_file_start() and debugfs_use_file_finish(), increment
and decrement it respectively. Once the count drops to zero,
debugfs_file_put() will signal a completion which is possibly being waited
for from debugfs_remove().
Thus, as long as there is a debugfs_file_get() not yet matched by a
corresponding debugfs_file_put() around, debugfs_remove() will block.

Actual users of debugfs_use_file_start() and -finish() will get converted
to the new debugfs_file_get() and debugfs_file_put() by followup patches.

Fixes: 49d200deaa ("debugfs: prevent access to removed files' private data")
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:02 +01:00
Nicolai Stange
7c8d469877 debugfs: add support for more elaborate ->d_fsdata
Currently, the user provided fops, "real_fops", are stored directly into
->d_fsdata.

In order to be able to store more per-file state and thus prepare for more
granular file removal protection, wrap the real_fops into a dynamically
allocated container struct, debugfs_fsdata.

A struct debugfs_fsdata gets allocated at file creation and freed from the
newly intoduced ->d_release().

Finally, move the implementation of debugfs_real_fops() out of the public
debugfs header such that struct debugfs_fsdata's declaration can be kept
private.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:02 +01:00
Linus Torvalds
78dcf73421 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ->s_options removal from Al Viro:
 "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
  gets moved to explicit ->show_options(), killing ->s_options off +
  some cosmetic bits around fs/namespace.c and friends. Basically, the
  stuff needed to work with fsmount series with minimum of conflicts
  with other work.

  It's not strictly required for this merge window, but it would reduce
  the PITA during the coming cycle, so it would be nice to have those
  bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  isofs: Fix isofs_show_options()
  VFS: Kill off s_options and helpers
  orangefs: Implement show_options
  9p: Implement show_options
  isofs: Implement show_options
  afs: Implement show_options
  affs: Implement show_options
  befs: Implement show_options
  spufs: Implement show_options
  bpf: Implement show_options
  ramfs: Implement show_options
  pstore: Implement show_options
  omfs: Implement show_options
  hugetlbfs: Implement show_options
  VFS: Don't use save/replace_mount_options if not using generic_show_options
  VFS: Provide empty name qstr
  VFS: Make get_filesystem() return the affected filesystem
  VFS: Clean up whitespace in fs/namespace.c and fs/super.c
  Provide a function to create a NUL-terminated string from unterminated data
2017-07-15 12:00:42 -07:00
Linus Torvalds
b8d4c1f9f4 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc filesystem updates from Al Viro:
 "Assorted normal VFS / filesystems stuff..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  dentry name snapshots
  Make statfs properly return read-only state after emergency remount
  fs/dcache: init in_lookup_hashtable
  minix: Deinline get_block, save 2691 bytes
  fs: Reorder inode_owner_or_capable() to avoid needless
  fs: warn in case userspace lied about modprobe return
2017-07-08 10:50:54 -07:00
Al Viro
49d31c2f38 dentry name snapshots
take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified).  In either case the pointer to stable
string is stored into the same structure.

dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().

Intended use:
	struct name_snapshot s;

	take_dentry_name_snapshot(&s, dentry);
	...
	access s.name
	...
	release_dentry_name_snapshot(&s);

Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-07 20:09:10 -04:00
David Howells
c3d98ea082 VFS: Don't use save/replace_mount_options if not using generic_show_options
btrfs, debugfs, reiserfs and tracefs call save_mount_options() and reiserfs
calls replace_mount_options(), but they then implement their own
->show_options() methods and don't touch s_options, rendering the saved
options unnecessary.  I'm trying to eliminate s_options to make it easier
to implement a context-based mount where the mount options can be passed
individually over a file descriptor.

Remove the calls to save/replace_mount_options() call in these cases.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Chris Mason <clm@fb.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Steven Rostedt <rostedt@goodmis.org>
cc: linux-btrfs@vger.kernel.org
cc: reiserfs-devel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-06 03:31:46 -04:00
Mauro Carvalho Chehab
e1511a840a fs: fix the location of the kernel-api book
The kernel-api book is now part of the core-api. Update its
location.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:23 -03:00
Mauro Carvalho Chehab
e1b4fc7add fs: update location of filesystems documentation
The filesystem documentation was moved from DocBook to
Documentation/filesystems/. Update it at the sources.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:22 -03:00
Eric Biggers
cda37124f4 fs: constify tree_descr arrays passed to simple_fill_super()
simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory.  Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Linus Torvalds
f1ef09fde1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "There is a lot here. A lot of these changes result in subtle user
  visible differences in kernel behavior. I don't expect anything will
  care but I will revert/fix things immediately if any regressions show
  up.

  From Seth Forshee there is a continuation of the work to make the vfs
  ready for unpriviled mounts. We had thought the previous changes
  prevented the creation of files outside of s_user_ns of a filesystem,
  but it turns we missed the O_CREAT path. Ooops.

  Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
  standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
  children that are forked after the prctl are considered and not
  children forked before the prctl. The only known user of this prctl
  systemd forks all children after the prctl. So no userspace
  regressions will occur. Holding earlier forked children to the same
  rules as later forked children creates a semantic that is sane enough
  to allow checkpoing of processes that use this feature.

  There is a long delayed change by Nikolay Borisov to limit inotify
  instances inside a user namespace.

  Michael Kerrisk extends the API for files used to maniuplate
  namespaces with two new trivial ioctls to allow discovery of the
  hierachy and properties of namespaces.

  Konstantin Khlebnikov with the help of Al Viro adds code that when a
  network namespace exits purges it's sysctl entries from the dcache. As
  in some circumstances this could use a lot of memory.

  Vivek Goyal fixed a bug with stacked filesystems where the permissions
  on the wrong inode were being checked.

  I continue previous work on ptracing across exec. Allowing a file to
  be setuid across exec while being ptraced if the tracer has enough
  credentials in the user namespace, and if the process has CAP_SETUID
  in it's own namespace. Proc files for setuid or otherwise undumpable
  executables are now owned by the root in the user namespace of their
  mm. Allowing debugging of setuid applications in containers to work
  better.

  A bug I introduced with permission checking and automount is now
  fixed. The big change is to mark the mounts that the kernel initiates
  as a result of an automount. This allows the permission checks in sget
  to be safely suppressed for this kind of mount. As the permission
  check happened when the original filesystem was mounted.

  Finally a special case in the mount namespace is removed preventing
  unbounded chains in the mount hash table, and making the semantics
  simpler which benefits CRIU.

  The vfs fix along with related work in ima and evm I believe makes us
  ready to finish developing and merge fully unprivileged mounts of the
  fuse filesystem. The cleanups of the mount namespace makes discussing
  how to fix the worst case complexity of umount. The stacked filesystem
  fixes pave the way for adding multiple mappings for the filesystem
  uids so that efficient and safer containers can be implemented"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc/sysctl: Don't grab i_lock under sysctl_lock.
  vfs: Use upper filesystem inode in bprm_fill_uid()
  proc/sysctl: prune stale dentries during unregistering
  mnt: Tuck mounts under others instead of creating shadow/side mounts.
  prctl: propagate has_child_subreaper flag to every descendant
  introduce the walk_process_tree() helper
  nsfs: Add an ioctl() to return owner UID of a userns
  fs: Better permission checking for submounts
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  vfs: open() with O_CREAT should not create inodes with unknown ids
  nsfs: Add an ioctl() to return the namespace type
  proc: Better ownership of files for non-dumpable tasks in user namespaces
  exec: Remove LSM_UNSAFE_PTRACE_CAP
  exec: Test the ptracer's saved cred to see if the tracee can gain caps
  exec: Don't reset euid and egid when the tracee has CAP_SETUID
  inotify: Convert to using per-namespace limits
2017-02-23 20:33:51 -08:00