Merge branch 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull openat2 support from Al Viro:
 "This is the openat2() series from Aleksa Sarai.

  I'm afraid that the rest of namei stuff will have to wait - it got
  zero review the last time I'd posted #work.namei, and there had been a
  leak in the posted series I'd caught only last weekend. I was going to
  repost it on Monday, but the window opened and the odds of getting any
  review during that... Oh, well.

  Anyway, openat2 part should be ready; that _did_ get sane amount of
  review and public testing, so here it comes"

From Aleksa's description of the series:
 "For a very long time, extending openat(2) with new features has been
  incredibly frustrating. This stems from the fact that openat(2) is
  possibly the most famous counter-example to the mantra "don't silently
  accept garbage from userspace" -- it doesn't check whether unknown
  flags are present[1].

  This means that (generally) the addition of new flags to openat(2) has
  been fraught with backwards-compatibility issues (O_TMPFILE has to be
  defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
  kernels gave errors, since it's insecure to silently ignore the
  flag[2]). All new security-related flags therefore have a tough road
  to being added to openat(2).

  Furthermore, the need for some sort of control over VFS's path
  resolution (to avoid malicious paths resulting in inadvertent
  breakouts) has been a very long-standing desire of many userspace
  applications.

  This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
  (which was a variant of David Drysdale's O_BENEATH patchset[4] which
  was a spin-off of the Capsicum project[5]) with a few additions and
  changes made based on the previous discussion within [6] as well as
  others I felt were useful.

  In line with the conclusions of the original discussion of
  AT_NO_JUMPS, the flag has been split up into separate flags. However,
  instead of being an openat(2) flag it is provided through a new
  syscall openat2(2) which provides several other improvements to the
  openat(2) interface (see the patch description for more details). The
  following new LOOKUP_* flags are added:

  LOOKUP_NO_XDEV:

     Blocks all mountpoint crossings (upwards, downwards, or through
     absolute links). Absolute pathnames alone in openat(2) do not
     trigger this. Magic-link traversal which implies a vfsmount jump is
     also blocked (though magic-link jumps on the same vfsmount are
     permitted).

  LOOKUP_NO_MAGICLINKS:

     Blocks resolution through /proc/$pid/fd-style links. This is done
     by blocking the usage of nd_jump_link() during resolution in a
     filesystem. The term "magic-links" is used to match with the only
     reference to these links in Documentation/, but I'm happy to change
     the name.

     It should be noted that this is different to the scope of
     ~LOOKUP_FOLLOW in that it applies to all path components. However,
     you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
     will *not* fail (assuming that no parent component was a
     magic-link), and you will have an fd for the magic-link.

     In order to correctly detect magic-links, the introduction of a new
     LOOKUP_MAGICLINK_JUMPED state flag was required.

  LOOKUP_BENEATH:

     Disallows escapes to outside the starting dirfd's
     tree, using techniques such as ".." or absolute links. Absolute
     paths in openat(2) are also disallowed.

     Conceptually this flag is to ensure you "stay below" a certain
     point in the filesystem tree -- but this requires some additional
     to protect against various races that would allow escape using
     "..".

     Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
     can trivially beam you around the filesystem (breaking the
     protection). In future, there might be similar safety checks done
     as in LOOKUP_IN_ROOT, but that requires more discussion.

  In addition, two new flags are added that expand on the above ideas:

  LOOKUP_NO_SYMLINKS:

     Does what it says on the tin. No symlink resolution is allowed at
     all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
     can still be used with NOFOLLOW to open an fd for the symlink as
     long as no parent path had a symlink component.

  LOOKUP_IN_ROOT:

     This is an extension of LOOKUP_BENEATH that, rather than blocking
     attempts to move past the root, forces all such movements to be
     scoped to the starting point. This provides chroot(2)-like
     protection but without the cost of a chroot(2) for each filesystem
     operation, as well as being safe against race attacks that
     chroot(2) is not.

     If a race is detected (as with LOOKUP_BENEATH) then an error is
     generated, and similar to LOOKUP_BENEATH it is not permitted to
     cross magic-links with LOOKUP_IN_ROOT.

     The primary need for this is from container runtimes, which
     currently need to do symlink scoping in userspace[7] when opening
     paths in a potentially malicious container.

     There is a long list of CVEs that could have bene mitigated by
     having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
     CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
     few).

  In order to make all of the above more usable, I'm working on
  libpathrs[8] which is a C-friendly library for safe path resolution.
  It features a userspace-emulated backend if the kernel doesn't support
  openat2(2). Hopefully we can get userspace to switch to using it, and
  thus get openat2(2) support for free once it's ready.

  Future work would include implementing things like
  RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
  programs to be sure they don't hit DoSes though stale NFS handles)"

* 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Documentation: path-lookup: include new LOOKUP flags
  selftests: add openat2(2) selftests
  open: introduce openat2(2) syscall
  namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
  namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  namei: LOOKUP_NO_XDEV: block mountpoint crossing
  namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
  namei: LOOKUP_NO_SYMLINKS: block symlink resolution
  namei: allow set_root() to produce errors
  namei: allow nd_jump_link() to produce errors
  nsfs: clean-up ns_get_path() signature to return int
  namei: only return -ECHILD from follow_dotdot_rcu()
This commit is contained in:
Linus Torvalds
2020-01-29 11:20:24 -08:00
44 changed files with 1700 additions and 120 deletions

View File

@@ -3302,7 +3302,9 @@ S: France
N: Aleksa Sarai
E: cyphar@cyphar.com
W: https://www.cyphar.com/
D: `pids` cgroup subsystem
D: /sys/fs/cgroup/pids
D: openat2(2)
S: Sydney, Australia
N: Dipankar Sarma
E: dipankar@in.ibm.com

View File

@@ -13,6 +13,7 @@ It has subsequently been updated to reflect changes in the kernel
including:
- per-directory parallel name lookup.
- ``openat2()`` resolution restriction flags.
Introduction to pathname lookup
===============================
@@ -235,6 +236,13 @@ renamed. If ``d_lookup`` finds that a rename happened while it
unsuccessfully scanned a chain in the hash table, it simply tries
again.
``rename_lock`` is also used to detect and defend against potential attacks
against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
the parent directory is moved outside the root, bypassing the ``path_equal()``
check). If ``rename_lock`` is updated during the lookup and the path encounters
a "..", a potential attack occurred and ``handle_dots()`` will bail out with
``-EAGAIN``.
inode->i_rwsem
~~~~~~~~~~~~~~
@@ -348,6 +356,13 @@ any changes to any mount points while stepping up. This locking is
needed to stabilize the link to the mounted-on dentry, which the
refcount on the mount itself doesn't ensure.
``mount_lock`` is also used to detect and defend against potential attacks
against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
the parent directory is moved outside the root, bypassing the ``path_equal()``
check). If ``mount_lock`` is updated during the lookup and the path encounters
a "..", a potential attack occurred and ``handle_dots()`` will bail out with
``-EAGAIN``.
RCU
~~~
@@ -405,6 +420,10 @@ is requested. Keeping a reference in the ``nameidata`` ensures that
only one root is in effect for the entire path walk, even if it races
with a ``chroot()`` system call.
It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
The root is needed when either of two conditions holds: (1) either the
pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
component is being handled, since "``..``" from the root must always stay
@@ -1149,7 +1168,7 @@ so ``NULL`` is returned to indicate that the symlink can be released and
the stack frame discarded.
The other case involves things in ``/proc`` that look like symlinks but
aren't really::
aren't really (and are therefore commonly referred to as "magic-links")::
$ ls -l /proc/self/fd/1
lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
@@ -1286,7 +1305,9 @@ A few flags
A suitable way to wrap up this tour of pathname walking is to list
the various flags that can be stored in the ``nameidata`` to guide the
lookup process. Many of these are only meaningful on the final
component, others reflect the current state of the pathname lookup.
component, others reflect the current state of the pathname lookup, and some
apply restrictions to all path components encountered in the path lookup.
And then there is ``LOOKUP_EMPTY``, which doesn't fit conceptually with
the others. If this is not set, an empty pathname causes an error
very early on. If it is set, empty pathnames are not considered to be
@@ -1310,13 +1331,48 @@ longer needed.
``LOOKUP_JUMPED`` means that the current dentry was chosen not because
it had the right name but for some other reason. This happens when
following "``..``", following a symlink to ``/``, crossing a mount point
or accessing a "``/proc/$PID/fd/$FD``" symlink. In this case the
filesystem has not been asked to revalidate the name (with
``d_revalidate()``). In such cases the inode may still need to be
revalidated, so ``d_op->d_weak_revalidate()`` is called if
or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
link"). In this case the filesystem has not been asked to revalidate the
name (with ``d_revalidate()``). In such cases the inode may still need
to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
``LOOKUP_JUMPED`` is set when the look completes - which may be at the
final component or, when creating, unlinking, or renaming, at the penultimate component.
Resolution-restriction flags
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to allow userspace to protect itself against certain race conditions
and attack scenarios involving changing path components, a series of flags are
available which apply restrictions to all path components encountered during
path lookup. These flags are exposed through ``openat2()``'s ``resolve`` field.
``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
This is distinctly different from ``LOOKUP_FOLLOW``, because the latter only
relates to restricting the following of trailing symlinks.
``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
ensure that they return errors from ``nd_jump_link()``, because that is how
``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
``LOOKUP_NO_XDEV`` blocks all ``vfsmount`` traversals (this includes both
bind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
lookup is determined by the first mountpoint the path lookup reaches --
absolute paths start with the ``vfsmount`` of ``/``, and relative paths start
with the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
``vfsmount`` of the path is unchanged.
``LOOKUP_BENEATH`` blocks any path components which resolve outside the
starting point of the resolution. This is done by blocking ``nd_jump_root()``
as well as blocking ".." if it would jump outside the starting point.
``rename_lock`` and ``mount_lock`` are used to detect attacks against the
resolution of "..". Magic-links are also blocked.
``LOOKUP_IN_ROOT`` resolves all path components as though the starting point
were the filesystem root. ``nd_jump_root()`` brings the resolution back to to
the starting point, and ".." at the starting point will act as a no-op. As with
``LOOKUP_BENEATH``, ``rename_lock`` and ``mount_lock`` are used to detect
attacks against ".." resolution. Magic-links are also blocked.
Final-component flags
~~~~~~~~~~~~~~~~~~~~~

View File

@@ -6458,6 +6458,7 @@ F: fs/*
F: include/linux/fs.h
F: include/linux/fs_types.h
F: include/uapi/linux/fs.h
F: include/uapi/linux/openat2.h
FINTEK F75375S HARDWARE MONITOR AND FAN CONTROLLER DRIVER
M: Riku Voipio <riku.voipio@iki.fi>

View File

@@ -475,3 +475,4 @@
543 common fspick sys_fspick
544 common pidfd_open sys_pidfd_open
# 545 reserved for clone3
547 common openat2 sys_openat2

View File

@@ -449,3 +449,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
437 common openat2 sys_openat2

View File

@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
#define __NR_compat_syscalls 436
#define __NR_compat_syscalls 438
#endif
#define __ARCH_WANT_SYS_CLONE

View File

@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#define __NR_clone3 435
__SYSCALL(__NR_clone3, sys_clone3)
#define __NR_openat2 437
__SYSCALL(__NR_openat2, sys_openat2)
/*
* Please add new compat syscalls above this comment and update

View File

@@ -356,3 +356,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
437 common openat2 sys_openat2

View File

@@ -435,3 +435,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 __sys_clone3
437 common openat2 sys_openat2

View File

@@ -441,3 +441,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
437 common openat2 sys_openat2

View File

@@ -374,3 +374,4 @@
433 n32 fspick sys_fspick
434 n32 pidfd_open sys_pidfd_open
435 n32 clone3 __sys_clone3
437 n32 openat2 sys_openat2

View File

@@ -350,3 +350,4 @@
433 n64 fspick sys_fspick
434 n64 pidfd_open sys_pidfd_open
435 n64 clone3 __sys_clone3
437 n64 openat2 sys_openat2

View File

@@ -423,3 +423,4 @@
433 o32 fspick sys_fspick
434 o32 pidfd_open sys_pidfd_open
435 o32 clone3 __sys_clone3
437 o32 openat2 sys_openat2

View File

@@ -433,3 +433,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3_wrapper
437 common openat2 sys_openat2

View File

@@ -517,3 +517,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 nospu clone3 ppc_clone3
437 common openat2 sys_openat2

View File

@@ -438,3 +438,4 @@
433 common fspick sys_fspick sys_fspick
434 common pidfd_open sys_pidfd_open sys_pidfd_open
435 common clone3 sys_clone3 sys_clone3
437 common openat2 sys_openat2 sys_openat2

View File

@@ -438,3 +438,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
437 common openat2 sys_openat2

View File

@@ -481,3 +481,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
437 common openat2 sys_openat2

View File

@@ -440,3 +440,4 @@
433 i386 fspick sys_fspick __ia32_sys_fspick
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
435 i386 clone3 sys_clone3 __ia32_sys_clone3
437 i386 openat2 sys_openat2 __ia32_sys_openat2

View File

@@ -357,6 +357,7 @@
433 common fspick __x64_sys_fspick
434 common pidfd_open __x64_sys_pidfd_open
435 common clone3 __x64_sys_clone3/ptregs
437 common openat2 __x64_sys_openat2
#
# x32-specific system call numbers start at 512 to avoid cache impact

Some files were not shown because too many files have changed in this diff Show More