There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The kernfs open method - kernfs_fop_open() - inherited extra
permission checks from sysfs. While the vfs layer allows ignoring the
read/write permissions checks if the issuer has CAP_DAC_OVERRIDE,
sysfs explicitly denied open regardless of the cap if the file doesn't
have any of the UGO perms of the requested access or doesn't implement
the requested operation. It can be debated whether this was a good
idea or not but the behavior is too subtle and dangerous to change at
this point.
After cgroup got converted to kernfs, this extra perm check also got
applied to cgroup breaking libcgroup which opens write-only files with
O_RDWR as root. This patch gates the extra open permission check with
a new flag KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK and enables it for sysfs.
For sysfs, nothing changes. For cgroup, root now can perform any
operation regardless of the permissions as it was before kernfs
conversion. Note that kernfs still fails unimplemented operations
with -EINVAL.
While at it, add comments explaining KERNFS_ROOT flags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrey Wagin <avagin@gmail.com>
Tested-by: Andrey Wagin <avagin@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
References: http://lkml.kernel.org/g/CANaxB-xUm3rJ-Cbp72q-rQJO5mZe1qK6qXsQM=vh0U8upJ44+A@mail.gmail.com
Fixes: 2bd59d48eb ("cgroup: convert to kernfs")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While updating how mmap enabled kernfs files are handled by lockdep,
9b2db6e189 ("sysfs: bail early from kernfs_file_mmap() to avoid
spurious lockdep warning") inadvertently dropped error return check
from kernfs_file_mmap(). The intention was just dropping "if
(ops->mmap)" check as the control won't reach the point if the mmap
callback isn't implemented, but I mistakenly removed the error return
check together with it.
This led to Xorg crash on i810 which was reported and bisected to the
commit and then to the specific change by Tobias.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-bisected-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
Tested-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
References: http://lkml.kernel.org/g/533D01BD.1010200@googlemail.com
Cc: stable <stable@vger.kernel.org> # 3.14
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently kernfs_link_sibling() increates parent->dir.subdirs before
adding the node into parent's chidren rb tree.
Because it is possible that kernfs_link_sibling() couldn't find
a suitable slot and bail out, this leads to a mismatch between
elevated subdir count with actual children node numbers.
This patches fix this problem, by moving the subdir accouting
after the actual addtion happening.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_notify() is used to indicate either new data is available or
the content of a file has changed. It currently only triggers poll
which may not be the most convenient to monitor especially when there
are a lot to monitor. Let's hook it up to fsnotify too so that the
events can be monitored via inotify too.
fsnotify_modify() requires file * but kernfs_notify() doesn't have any
specific file associated; however, we can walk all super_blocks
associated with a kernfs_root and as kernfs always associate one ino
with inode and one dentry with an inode, it's trivial to look up the
dentry associated with a given kernfs_node. As any active monitor
would pin dentry, just looking up existing dentry is enough. This
patch looks up the dentry associated with the specified kernfs_node
and generates events equivalent to fsnotify_modify().
Note that as fsnotify doesn't provide fsnotify_modify() equivalent
which can be called with dentry, kernfs_notify() directly calls
fsnotify_parent() and fsnotify(). It might be better to add a wrapper
in fsnotify.h instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, there's no way to find out which super_blocks are
associated with a given kernfs_root. Let's implement it - the planned
inotify extension to kernfs_notify() needs it.
Make kernfs_super_info point back to the super_block and chain it at
kernfs_root->supers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_iattrs is allocated lazily when operations which require it
take place; unfortunately, the lazy allocation and returning weren't
properly synchronized and when there are multiple concurrent
operations, it might end up returning kernfs_iattrs which hasn't
finished initialization yet or different copies to different callers.
Fix it by synchronizing with a mutex. This can be smarter with memory
barriers but let's go there if it actually turns out to be necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/533ABA32.9080602@oracle.com
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # 3.14
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Merge first patch-bomb from Andrew Morton:
- Various misc bits
- kmemleak fixes
- small befs, codafs, cifs, efs, freexxfs, hfsplus, minixfs, reiserfs things
- fanotify
- I appear to have become SuperH maintainer
- ocfs2 updates
- direct-io tweaks
- a bit of the MM queue
- printk updates
- MAINTAINERS maintenance
- some backlight things
- lib/ updates
- checkpatch updates
- the rtc queue
- nilfs2 updates
- Small Documentation/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (237 commits)
Documentation/SubmittingPatches: remove references to patch-scripts
Documentation/SubmittingPatches: update some dead URLs
Documentation/filesystems/ntfs.txt: remove changelog reference
Documentation/kmemleak.txt: updates
fs/reiserfs/super.c: add __init to init_inodecache
fs/reiserfs: move prototype declaration to header file
fs/hfsplus/attributes.c: add __init to hfsplus_create_attr_tree_cache()
fs/hfsplus/extents.c: fix concurrent acess of alloc_blocks
fs/hfsplus/extents.c: remove unused variable in hfsplus_get_block
nilfs2: update project's web site in nilfs2.txt
nilfs2: update MAINTAINERS file entries fix
nilfs2: verify metadata sizes read from disk
nilfs2: add FITRIM ioctl support for nilfs2
nilfs2: add nilfs_sufile_trim_fs to trim clean segs
nilfs2: implementation of NILFS_IOCTL_SET_SUINFO ioctl
nilfs2: add nilfs_sufile_set_suinfo to update segment usage
nilfs2: add struct nilfs_suinfo_update and flags
nilfs2: update MAINTAINERS file entries
fs/coda/inode.c: add __init to init_inodecache()
BEFS: logging cleanup
...
Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:
- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.
After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.
- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.
- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.
- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.
There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"
* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...
The hash values 0 and 1 are reserved for magic directory entries, but
the code only prevents names hashing to 0. This patch fixes the test
to also prevent hash value 1.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As mount() and kill_sb() is not a one-to-one match, we shoudn't get
ns refcnt unconditionally in sysfs_mount(), and instead we should
get the refcnt only when kernfs_mount() allocated a new superblock.
v2:
- Changed the name of the new argument, suggested by Tejun.
- Made the argument optional, suggested by Tejun.
v3:
- Make the new argument as second-to-last arg, suggested by Tejun.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
fs/kernfs/mount.c | 8 +++++++-
fs/sysfs/mount.c | 5 +++--
include/linux/kernfs.h | 9 +++++----
3 files changed, 15 insertions(+), 7 deletions(-)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cgroup->name handling became quite complicated over time involving
dedicated struct cgroup_name for RCU protection. Now that cgroup is
on kernfs, we can drop all of it and simply use kernfs_name/path() and
friends. Replace cgroup->name and all related code with kernfs
name/path constructs.
* Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
of kernfs counterparts, which involves semantic changes.
pr_cont_cgroup_name() and pr_cont_cgroup_path() added.
* cgroup->name handling dropped from cgroup_rename().
* All users of cgroup_name/path() updated to the new semantics. Users
which were formatting the string just to printk them are converted
to use pr_cont_cgroup_name/path() instead, which simplifies things
quite a bit. As cgroup_name() no longer requires RCU read lock
around it, RCU lockings which were protecting only cgroup_name() are
removed.
v2: Comment above oom_info_lock updated as suggested by Michal.
v3: dummy_top doesn't have a kn associated and
pr_cont_cgroup_name/path() ended up calling the matching kernfs
functions with NULL kn leading to oops. Test for NULL kn and
print "/" if so. This issue was reported by Fengguang Wu.
v4: Rebased on top of 0ab02ca8f8 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
3eef34ad7d ("kernfs: implement kernfs_get_parent(),
kernfs_name/path() and friends") restructured kernfs_rename_ns() such
that new name assignment happens under kernfs_rename_lock;
unfortunately, it mistakenly passed NULL to kernfs_name_hash() to
calculate the new hash if the name hasn't changed, which can lead to
oops.
Fix it by using kn->name and kn->ns when calculating the new hash.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter dan.carpenter@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As sysfs was kernfs's only user, kernfs has been piggybacking on
CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very
soon. Introduce a separate config option CONFIG_KERNFS which is to be
selected by kernfs users.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_node->parent and ->name are currently marked as "published"
indicating that kernfs users may access them directly; however, those
fields may get updated by kernfs_rename[_ns]() and unrestricted access
may lead to erroneous values or oops.
Protect ->parent and ->name updates with a irq-safe spinlock
kernfs_rename_lock and implement the following accessors for these
fields.
* kernfs_name() - format the node's name into the specified buffer
* kernfs_path() - format the node's path into the specified buffer
* pr_cont_kernfs_name() - pr_cont a node's name (doesn't need buffer)
* pr_cont_kernfs_path() - pr_cont a node's path (doesn't need buffer)
* kernfs_get_parent() - pin and return a node's parent
All can be called under any context. The recursive sysfs_pathname()
in fs/sysfs/dir.c is replaced with kernfs_path() and
sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
dereferencing parent directly.
v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
static inline making it cause a lot of build warnings. Add it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Implement helpers to determine node from dentry and root from
super_block. Also add a kernfs_rename_ns() wrapper which assumes NULL
namespace. These generally make sense and will be used by cgroup.
v2: Some dummy implementations for !CONFIG_SYSFS was missing. Fixed.
Reported by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A write to a kernfs_node is buffered through a kernel buffer. Writes
<= PAGE_SIZE are performed atomically, while larger ones are executed
in PAGE_SIZE chunks. While this is enough for sysfs, cgroup which is
scheduled to be converted to use kernfs needs a bit more control over
it.
This patch adds kernfs_ops->atomic_write_len. If not set (zero), the
behavior stays the same. If set, writes upto the size are executed
atomically and larger writes are rejected with -E2BIG.
A different implementation strategy would be allowing configuring
chunking size while making the original write size available to the
write method; however, such strategy, while being more complicated,
doesn't really buy anything. If the write implementation has to
handle chunking, the specific chunk size shouldn't matter all that
much.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, kernfs_nodes are made visible to userland on creation,
which makes it difficult for kernfs users to atomically succeed or
fail creation of multiple nodes. In addition, if something fails
after creating some nodes, the created nodes might already be in use
and their active refs need to be drained for removal, which has the
potential to introduce tricky reverse locking dependency on active_ref
depending on how the error path is synchronized.
This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
If set, all nodes under the root are created in the deactivated state
and stay invisible to userland until explicitly enabled by the new
kernfs_activate() API. Also, nodes which have never been activated
are guaranteed to bypass draining on removal thus allowing error paths
to not worry about lockding dependency on active_ref draining.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_iop_lookup(), kernfs_dir_pos() and kernfs_dir_next_pos() were
missing kernfs_active() tests before using the found kernfs_node. As
deactivated state is currently visible only while a node is being
removed, this doesn't pose an actual problem. e.g. lookup succeeding
on a deactivated node doesn't harm anything as the eventual file
operations are gonna fail and those failures are indistinguishible
from the cases in which the lookups had happened before the node was
deactivated.
However, we're gonna allow new nodes to be created deactivated and
then activated explicitly by the kernfs user when it sees fit. This
is to support atomically making multiple nodes visible to userland and
thus those nodes must not be visible to userland before activated.
Let's plug the lookup and readdir holes so that deactivated nodes are
invisible to userland.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>