Commit Graph

506 Commits

Author SHA1 Message Date
Linus Torvalds
0cbee99269 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace updates from Eric Biederman:
 "Long ago and far away when user namespaces where young it was realized
  that allowing fresh mounts of proc and sysfs with only user namespace
  permissions could violate the basic rule that only root gets to decide
  if proc or sysfs should be mounted at all.

  Some hacks were put in place to reduce the worst of the damage could
  be done, and the common sense rule was adopted that fresh mounts of
  proc and sysfs should allow no more than bind mounts of proc and
  sysfs.  Unfortunately that rule has not been fully enforced.

  There are two kinds of gaps in that enforcement.  Only filesystems
  mounted on empty directories of proc and sysfs should be ignored but
  the test for empty directories was insufficient.  So in my tree
  directories on proc, sysctl and sysfs that will always be empty are
  created specially.  Every other technique is imperfect as an ordinary
  directory can have entries added even after a readdir returns and
  shows that the directory is empty.  Special creation of directories
  for mount points makes the code in the kernel a smidge clearer about
  it's purpose.  I asked container developers from the various container
  projects to help test this and no holes were found in the set of mount
  points on proc and sysfs that are created specially.

  This set of changes also starts enforcing the mount flags of fresh
  mounts of proc and sysfs are consistent with the existing mount of
  proc and sysfs.  I expected this to be the boring part of the work but
  unfortunately unprivileged userspace winds up mounting fresh copies of
  proc and sysfs with noexec and nosuid clear when root set those flags
  on the previous mount of proc and sysfs.  So for now only the atime,
  read-only and nodev attributes which userspace happens to keep
  consistent are enforced.  Dealing with the noexec and nosuid
  attributes remains for another time.

  This set of changes also addresses an issue with how open file
  descriptors from /proc/<pid>/ns/* are displayed.  Recently readlink of
  /proc/<pid>/fd has been triggering a WARN_ON that has not been
  meaningful since it was added (as all of the code in the kernel was
  converted) and is not now actively wrong.

  There is also a short list of issues that have not been fixed yet that
  I will mention briefly.

  It is possible to rename a directory from below to above a bind mount.
  At which point any directory pointers below the renamed directory can
  be walked up to the root directory of the filesystem.  With user
  namespaces enabled a bind mount of the bind mount can be created
  allowing the user to pick a directory whose children they can rename
  to outside of the bind mount.  This is challenging to fix and doubly
  so because all obvious solutions must touch code that is in the
  performance part of pathname resolution.

  As mentioned above there is also a question of how to ensure that
  developers by accident or with purpose do not introduce exectuable
  files on sysfs and proc and in doing so introduce security regressions
  in the current userspace that will not be immediately obvious and as
  such are likely to require breaking userspace in painful ways once
  they are recognized"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  vfs: Remove incorrect debugging WARN in prepend_path
  mnt: Update fs_fully_visible to test for permanently empty directories
  sysfs: Create mountpoints with sysfs_create_mount_point
  sysfs: Add support for permanently empty directories to serve as mount points.
  kernfs: Add support for always empty directories.
  proc: Allow creating permanently empty directories that serve as mount points
  sysctl: Allow creating permanently empty directories that serve as mountpoints.
  fs: Add helper functions for permanently empty directories.
  vfs: Ignore unlocked mounts in fs_fully_visible
  mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
  mnt: Refactor the logic for mounting sysfs and proc in a user namespace
2015-07-03 15:20:57 -07:00
Eric W. Biederman
87d2846fcf sysfs: Add support for permanently empty directories to serve as mount points.
Add two functions sysfs_create_mount_point and
sysfs_remove_mount_point that hang a permanently empty directory off
of a kobject or remove a permanently emptpy directory hanging from a
kobject.  Export these new functions so modular filesystems can use
them.

Cc: stable@vger.kernel.org
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:45 -05:00
Vladimir Zapolskiy
eaa5cd9263 fs: sysfs: don't pass count == 0 to bin file readers
If count == 0 bytes are requested by a reader, sysfs_kf_bin_read()
deliberately returns 0 without passing a potentially harmful value to
some externally defined underlying battr->read() function.

However in case of (pos == size && count) the next clause always sets
count to 0 and this value is handed over to battr->read().

The change intends to make obsolete (and remove later) a redundant
sanity check in battr->read(), if it is present, or add more
protection to struct bin_attribute users, who does not care about
input arguments.

Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-01 10:17:17 +09:00
Antonio Ospite
ed1dc8a894 sysfs: disambiguate between "error code" and "failure" in comments
The sentence "Returns 0 on success or error" might be misinterpreted as
"the function will always returns 0", make it less ambiguous.

Also, use the word "failure" as the contrary of "success".

Signed-off-by: Antonio Ospite <ao2@ao2.it>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-24 12:31:33 -07:00
Eric W. Biederman
1b852bceb0 mnt: Refactor the logic for mounting sysfs and proc in a user namespace
Fresh mounts of proc and sysfs are a very special case that works very
much like a bind mount.  Unfortunately the current structure can not
preserve the MNT_LOCK... mount flags.  Therefore refactor the logic
into a form that can be modified to preserve those lock bits.

Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
of the filesystem be fully visible in the current mount namespace,
before the filesystem may be mounted.

Move the logic for calling fs_fully_visible from proc and sysfs into
fs/namespace.c where it has greater access to mount namespace state.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-05-13 21:44:11 -05:00
Vivien Didelot
d8bf8c92e8 sysfs: Only accept read/write permissions for file attributes
For sysfs file attributes, only read and write permissions make sense.
Mask provided attribute permissions accordingly and send a warning
to the console if invalid permission bits are set.

This patch is originally from Guenter [1] and includes the fixup
explained in the thread, that is printing permissions in octal format
and limiting the scope of attributes to SYSFS_PREALLOC | 0664.

[1] https://lkml.org/lkml/2015/1/19/599

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-25 13:27:57 +01:00
Guenter Roeck
da4759c73b sysfs: Use only return value from is_visible for the file mode
Up to now, is_visible can only be used to either remove visibility
of a file entirely or to add permissions, but not to reduce permissions.
This makes it impossible, for example, to use DEVICE_ATTR_RW to define
file attributes and reduce permissions to read-only.

This behavior is undesirable and unnecessarily complicates code which
needs to reduce permissions; instead of just returning the desired
permissions, it has to ensure that the permissions in the attribute
variable declaration only reflect the minimal permissions ever needed.

Change semantics of is_visible to only use the permissions returned
from it instead of oring the returned value with the hard-coded
permissions.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-25 13:27:57 +01:00
Linus Torvalds
9682ec9692 Merge tag 'driver-core-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg KH:
 "Really tiny set of patches for this kernel.  Nothing major, all
  described in the shortlog and have been in linux-next for a while"

* tag 'driver-core-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  sysfs: fix warning when creating a sysfs group without attributes
  firmware_loader: handle timeout via wait_for_completion_interruptible_timeout()
  firmware_loader: abort request if wait_for_completion is interrupted
  firmware: Correct function name in comment
  device: Change dev_<level> logging functions to return void
  device: Fix dev_dbg_once macro
2015-02-15 11:11:47 -08:00
Tejun Heo
dfeb0750b6 kernfs: remove KERNFS_STATIC_NAME
When a new kernfs node is created, KERNFS_STATIC_NAME is used to avoid
making a separate copy of its name.  It's currently only used for sysfs
attributes whose filenames are required to stay accessible and unchanged.
There are rare exceptions where these names are allocated and formatted
dynamically but for the vast majority of cases they're consts in the
rodata section.

Now that kernfs is converted to use kstrdup_const() and kfree_const(),
there's little point in keeping KERNFS_STATIC_NAME around.  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:36 -08:00
Javi Merino
adf305f778 sysfs: fix warning when creating a sysfs group without attributes
When attempting to create a gropu without attrs, the warning prints the
name of the group.  However, the check for name being a NULL pointer is
wrong: it uses the pointer to the name when it's NULL.  Fix it to use
the name if present, otherwise just put an empty string.

Cc: Bruno Prémont <bonbons@linux-vserver.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-03 15:50:31 -08:00
NeilBrown
4ef67a8c95 sysfs/kernfs: make read requests on pre-alloc files use the buffer.
To match the previous patch which used the pre-alloc buffer for
writes, this patch causes reads to use the same buffer.
This is not strictly necessary as the current seq_read() will allocate
on first read, so user-space can trigger the required pre-alloc.  But
consistency is valuable.

The read function is somewhat simpler than seq_read() and, for example,
does not support reading from an offset into the file: reads must be
at the start of the file.

As seq_read() does not use the prealloc buffer, ->seq_show is
incompatible with ->prealloc and caused an EINVAL return from open().
sysfs code which calls into kernfs always chooses the correct function.

As the buffer is shared with writes and other reads, the mutex is
extended to cover the copy_to_user.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-07 10:54:38 -08:00
NeilBrown
2b75869bba sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
md/raid allows metadata management to be performed in user-space.
A various times, particularly on device failure, the metadata needs
to be updated before further writes can be permitted.
This means that the user-space program which updates metadata much
not block on writeout, and so must not allocate memory.

mlockall(MCL_CURRENT|MCL_FUTURE) and pre-allocation can avoid all
memory allocation issues for user-memory, but that does not help
kernel memory.
Several kernel objects can be pre-allocated.  e.g. files opened before
any writes to the array are permitted.
However some kernel allocation happens in places that cannot be
pre-allocated.
In particular, writes to sysfs files (to tell md that it can now
allow writes to the array) allocate a buffer using GFP_KERNEL.

This patch allows attributes to be marked as "PREALLOC".  In that case
the maximal buffer is allocated when the file is opened, and then used
on each write instead of allocating a new buffer.

As the same buffer is now shared for all writes on the same file
description, the mutex is extended to cover full use of the buffer
including the copy_from_user().

The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is
inspected by sysfs_add_file_mode_ns() to determine if the file should be
marked as requiring prealloc.

Despite the comment, we *do* use ->seq_show together with ->prealloc
in this patch.  The next patch fixes that.

Signed-off-by: NeilBrown  <neilb@suse.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-07 10:53:25 -08:00
Vladimir Zapolskiy
0936896056 fs: sysfs: return EGBIG on write if offset is larger than file size
According to the user expectations common utilities like dd or sh
redirection operator > should work correctly over binary files from
sysfs. At the moment doing excessive write can not be completed:

  write(1, "\0\0\0\0\0\0\0\0", 8)         = 4
  write(1, "\0\0\0\0", 4)                 = 0
  write(1, "\0\0\0\0", 4)                 = 0
  write(1, "\0\0\0\0", 4)                 = 0
  ...

Fix the problem by returning EFBIG described in man 2 write.

Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-07 10:52:20 -08:00
Jianyu Zhan
26fc9cd200 kernfs: move the last knowledge of sysfs out from kernfs
There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-27 14:33:17 -07:00
Robert ABEL
9f70a40128 sysfs: fix attribute_group bin file path on removal
Cody Schafer already fixed binary file creation for attribute groups, see [1].
This patch makes the appropriate changes for binary file removal
of attribute groups.
[1]: http://lkml.org/lkml/2014/2/27/832

Signed-off-by: Robert ABEL <rabel@cit-ec.uni-bielefeld.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-27 14:33:17 -07:00
Tejun Heo
f5c16f29bf sysfs: make sure read buffer is zeroed
13c589d5b0 ("sysfs: use seq_file when reading regular files")
switched sysfs from custom read implementation to seq_file to enable
later transition to kernfs.  After the change, the buffer passed to
->show() is acquired through seq_get_buf(); unfortunately, this
introduces a subtle behavior change.  Before the commit, the buffer
passed to ->show() was always zero as it was allocated using
get_zeroed_page().  Because seq_file doesn't clear buffers on
allocation and neither does seq_get_buf(), after the commit, depending
on the behavior of ->show(), we may end up exposing uninitialized data
to userland thus possibly altering userland visible behavior and
leaking information.

Fix it by explicitly clearing the buffer.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ron <ron@debian.org>
Fixes: 13c589d5b0 ("sysfs: use seq_file when reading regular files")
Cc: stable <stable@vger.kernel.org> # 3.13+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-20 10:15:53 +09:00
Tejun Heo
555724a831 kernfs, sysfs, cgroup: restrict extra perm check on open to sysfs
The kernfs open method - kernfs_fop_open() - inherited extra
permission checks from sysfs.  While the vfs layer allows ignoring the
read/write permissions checks if the issuer has CAP_DAC_OVERRIDE,
sysfs explicitly denied open regardless of the cap if the file doesn't
have any of the UGO perms of the requested access or doesn't implement
the requested operation.  It can be debated whether this was a good
idea or not but the behavior is too subtle and dangerous to change at
this point.

After cgroup got converted to kernfs, this extra perm check also got
applied to cgroup breaking libcgroup which opens write-only files with
O_RDWR as root.  This patch gates the extra open permission check with
a new flag KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK and enables it for sysfs.
For sysfs, nothing changes.  For cgroup, root now can perform any
operation regardless of the permissions as it was before kernfs
conversion.  Note that kernfs still fails unimplemented operations
with -EINVAL.

While at it, add comments explaining KERNFS_ROOT flags.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrey Wagin <avagin@gmail.com>
Tested-by: Andrey Wagin <avagin@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
References: http://lkml.kernel.org/g/CANaxB-xUm3rJ-Cbp72q-rQJO5mZe1qK6qXsQM=vh0U8upJ44+A@mail.gmail.com
Fixes: 2bd59d48eb ("cgroup: convert to kernfs")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13 13:21:40 +02:00
Tejun Heo
33ac1257ff sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()
All device_schedule_callback_owner() users are converted to use
device_remove_file_self().  Remove now unused
{sysfs|device}_schedule_callback_owner().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-16 11:56:33 -07:00
Greg Kroah-Hartman
72099304ee Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
This reverts commit d1ba277e79.

As reported by Stephen, this patch breaks linux-next as a ppc patch
suddenly (after 2 years) started using this old api call.  So revert it
for now, it will go away in 3.15-rc2 when we can change the PPC call to
the new api.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-25 20:54:57 -07:00
Greg Kroah-Hartman
13df797743 Merge 3.14-rc5 into driver-core-next
We want the fixes in here.
2014-03-02 20:09:08 -08:00
Li Zefan
fed95bab8d sysfs: fix namespace refcnt leak
As mount() and kill_sb() is not a one-to-one match, we shoudn't get
ns refcnt unconditionally in sysfs_mount(), and instead we should
get the refcnt only when kernfs_mount() allocated a new superblock.

v2:
- Changed the name of the new argument, suggested by Tejun.
- Made the argument optional, suggested by Tejun.

v3:
- Make the new argument as second-to-last arg, suggested by Tejun.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
 ---
 fs/kernfs/mount.c      | 8 +++++++-
 fs/sysfs/mount.c       | 5 +++--
 include/linux/kernfs.h | 9 +++++----
 3 files changed, 15 insertions(+), 7 deletions(-)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-25 07:37:52 -08:00
Cody P Schafer
aabaf4c205 sysfs: create bin_attributes under the requested group
bin_attributes created/updated in create_files() (such as those listed
via (struct device).attribute_groups) were not placed under the
specified group, and instead appeared in the base kobj directory.

Fix this by making bin_attributes use creating code similar to normal
attributes.

A quick grep shows that no one is using bin_attrs in a named attribute
group yet, so we can do this without breaking anything in usespace.

Note that I do not add is_visible() support to
bin_attributes, though that could be done as well.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-15 12:14:55 -08:00
Tejun Heo
ba341d55a4 kernfs: add CONFIG_KERNFS
As sysfs was kernfs's only user, kernfs has been piggybacking on
CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very
soon.  Introduce a separate config option CONFIG_KERNFS which is to be
selected by kernfs users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 16:08:57 -08:00
Tejun Heo
3eef34ad7d kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends
kernfs_node->parent and ->name are currently marked as "published"
indicating that kernfs users may access them directly; however, those
fields may get updated by kernfs_rename[_ns]() and unrestricted access
may lead to erroneous values or oops.

Protect ->parent and ->name updates with a irq-safe spinlock
kernfs_rename_lock and implement the following accessors for these
fields.

* kernfs_name()		- format the node's name into the specified buffer
* kernfs_path()		- format the node's path into the specified buffer
* pr_cont_kernfs_name()	- pr_cont a node's name (doesn't need buffer)
* pr_cont_kernfs_path()	- pr_cont a node's path (doesn't need buffer)
* kernfs_get_parent()	- pin and return a node's parent

All can be called under any context.  The recursive sysfs_pathname()
in fs/sysfs/dir.c is replaced with kernfs_path() and
sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
dereferencing parent directly.

v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
    static inline making it cause a lot of build warnings.  Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 16:05:35 -08:00
Tejun Heo
d35258ef70 kernfs: allow nodes to be created in the deactivated state
Currently, kernfs_nodes are made visible to userland on creation,
which makes it difficult for kernfs users to atomically succeed or
fail creation of multiple nodes.  In addition, if something fails
after creating some nodes, the created nodes might already be in use
and their active refs need to be drained for removal, which has the
potential to introduce tricky reverse locking dependency on active_ref
depending on how the error path is synchronized.

This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
If set, all nodes under the root are created in the deactivated state
and stay invisible to userland until explicitly enabled by the new
kernfs_activate() API.  Also, nodes which have never been activated
are guaranteed to bypass draining on removal thus allowing error paths
to not worry about lockding dependency on active_ref draining.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:52:48 -08:00