kernel

mirror of https://github.com/ukui/kernel.git synced 2026-03-09 10:07:04 -07:00

Author	SHA1	Message	Date
Linus Torvalds	3aa2fc1667	Merge tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here's the "big" driver core update for 4.7-rc1. Mostly just debugfs changes, the long-known and messy races with removing debugfs files should be fixed thanks to the great work of Nicolai Stange. We also have some isa updates in here (the x86 maintainers told me to take it through this tree), a new warning when we run out of dynamic char major numbers, and a few other assorted changes, details in the shortlog. All have been in linux-next for some time with no reported issues" * tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits) Revert "base: dd: don't remove driver_data in -EPROBE_DEFER case" gpio: ws16c48: Utilize the ISA bus driver gpio: 104-idio-16: Utilize the ISA bus driver gpio: 104-idi-48: Utilize the ISA bus driver gpio: 104-dio-48e: Utilize the ISA bus driver watchdog: ebc-c384_wdt: Utilize the ISA bus driver iio: stx104: Utilize the module_isa_driver and max_num_isa_dev macros iio: stx104: Add X86 dependency to STX104 Kconfig option Documentation: Add ISA bus driver documentation isa: Implement the max_num_isa_dev macro isa: Implement the module_isa_driver macro pnp: pnpbios: Add explicit X86_32 dependency to PNPBIOS isa: Decouple X86_32 dependency from the ISA Kconfig option driver-core: use 'dev' argument in dev_dbg_ratelimited stub base: dd: don't remove driver_data in -EPROBE_DEFER case kernfs: Move faulting copy_user operations outside of the mutex devcoredump: add scatterlist support debugfs: unproxify files created through debugfs_create_u32_array() debugfs: unproxify files created through debugfs_create_blob() debugfs: unproxify files created through debugfs_create_bool() ...	2016-05-20 21:26:15 -07:00
Serge E. Hallyn	4f41fc5962	cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup \| grep freezer)" cat /proc/self/mountinfo \| grep freezer \| awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2016-05-09 12:15:03 -04:00
Chris Wilson	e4234a1fc3	kernfs: Move faulting copy_user operations outside of the mutex A fault in a user provided buffer may lead anywhere, and lockdep warns that we have a potential deadlock between the mm->mmap_sem and the kernfs file mutex: [ 82.811702] ====================================================== [ 82.811705] [ INFO: possible circular locking dependency detected ] [ 82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted [ 82.811711] ------------------------------------------------------- [ 82.811714] kms_setmode/5859 is trying to acquire lock: [ 82.811717] (&dev->struct_mutex){+.+.+.}, at: [<ffffffff8150d9c1>] drm_gem_mmap+0x1a1/0x270 [ 82.811731] but task is already holding lock: [ 82.811734] (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0 [ 82.811745] which lock already depends on the new lock. [ 82.811749] the existing dependency chain (in reverse order) is: [ 82.811752] -> #3 (&mm->mmap_sem){++++++}: [ 82.811761] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0 [ 82.811766] [<ffffffff8118bc65>] __might_fault+0x75/0xa0 [ 82.811771] [<ffffffff8124da4a>] kernfs_fop_write+0x8a/0x180 [ 82.811787] [<ffffffff811d1023>] __vfs_write+0x23/0xe0 [ 82.811792] [<ffffffff811d1d74>] vfs_write+0xa4/0x190 [ 82.811797] [<ffffffff811d2c14>] SyS_write+0x44/0xb0 [ 82.811801] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73 [ 82.811807] -> #2 (s_active#6){++++.+}: [ 82.811814] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0 [ 82.811819] [<ffffffff8124c070>] __kernfs_remove+0x210/0x2f0 [ 82.811823] [<ffffffff8124d040>] kernfs_remove_by_name_ns+0x40/0xa0 [ 82.811828] [<ffffffff8124e9e0>] sysfs_remove_file_ns+0x10/0x20 [ 82.811832] [<ffffffff815318d4>] device_del+0x124/0x250 [ 82.811837] [<ffffffff81531a19>] device_unregister+0x19/0x60 [ 82.811841] [<ffffffff8153c051>] cpu_cache_sysfs_exit+0x51/0xb0 [ 82.811846] [<ffffffff8153c628>] cacheinfo_cpu_callback+0x38/0x70 [ 82.811851] [<ffffffff8109ae89>] notifier_call_chain+0x39/0xa0 [ 82.811856] [<ffffffff8109aef9>] __raw_notifier_call_chain+0x9/0x10 [ 82.811860] [<ffffffff810786de>] cpu_notify+0x1e/0x40 [ 82.811865] [<ffffffff81078779>] cpu_notify_nofail+0x9/0x20 [ 82.811869] [<ffffffff81078ac3>] _cpu_down+0x233/0x340 [ 82.811874] [<ffffffff81079019>] disable_nonboot_cpus+0xc9/0x350 [ 82.811878] [<ffffffff810d2e11>] suspend_devices_and_enter+0x5a1/0xb50 [ 82.811883] [<ffffffff810d3903>] pm_suspend+0x543/0x8d0 [ 82.811888] [<ffffffff810d1b77>] state_store+0x77/0xe0 [ 82.811892] [<ffffffff813fa68f>] kobj_attr_store+0xf/0x20 [ 82.811897] [<ffffffff8124e740>] sysfs_kf_write+0x40/0x50 [ 82.811902] [<ffffffff8124dafc>] kernfs_fop_write+0x13c/0x180 [ 82.811906] [<ffffffff811d1023>] __vfs_write+0x23/0xe0 [ 82.811910] [<ffffffff811d1d74>] vfs_write+0xa4/0x190 [ 82.811914] [<ffffffff811d2c14>] SyS_write+0x44/0xb0 [ 82.811918] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73 [ 82.811923] -> #1 (cpu_hotplug.lock){+.+.+.}: [ 82.811929] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0 [ 82.811933] [<ffffffff817b6f72>] mutex_lock_nested+0x62/0x3b0 [ 82.811940] [<ffffffff810784c1>] get_online_cpus+0x61/0x80 [ 82.811944] [<ffffffff811170eb>] stop_machine+0x1b/0xe0 [ 82.811949] [<ffffffffa0178edd>] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915] [ 82.812009] [<ffffffffa017d3a6>] ggtt_bind_vma+0x46/0x70 [i915] [ 82.812045] [<ffffffffa017eb70>] i915_vma_bind+0x140/0x290 [i915] [ 82.812081] [<ffffffffa01862b9>] i915_gem_object_do_pin+0x899/0xb00 [i915] [ 82.812117] [<ffffffffa0186555>] i915_gem_object_pin+0x35/0x40 [i915] [ 82.812154] [<ffffffffa019a23e>] intel_init_pipe_control+0xbe/0x210 [i915] [ 82.812192] [<ffffffffa0197312>] intel_logical_rings_init+0xe2/0xde0 [i915] [ 82.812232] [<ffffffffa0186fe3>] i915_gem_init+0xf3/0x130 [i915] [ 82.812278] [<ffffffffa02097ed>] i915_driver_load+0xf2d/0x1770 [i915] [ 82.812318] [<ffffffff81512474>] drm_dev_register+0xa4/0xb0 [ 82.812323] [<ffffffff8151467e>] drm_get_pci_dev+0xce/0x1e0 [ 82.812328] [<ffffffffa01472cf>] i915_pci_probe+0x2f/0x50 [i915] [ 82.812360] [<ffffffff8143f907>] pci_device_probe+0x87/0xf0 [ 82.812366] [<ffffffff81535f89>] driver_probe_device+0x229/0x450 [ 82.812371] [<ffffffff81536233>] __driver_attach+0x83/0x90 [ 82.812375] [<ffffffff81533c61>] bus_for_each_dev+0x61/0xa0 [ 82.812380] [<ffffffff81535879>] driver_attach+0x19/0x20 [ 82.812384] [<ffffffff8153535f>] bus_add_driver+0x1ef/0x290 [ 82.812388] [<ffffffff81536e9b>] driver_register+0x5b/0xe0 [ 82.812393] [<ffffffff8143e83b>] __pci_register_driver+0x5b/0x60 [ 82.812398] [<ffffffff81514866>] drm_pci_init+0xd6/0x100 [ 82.812402] [<ffffffffa027c094>] 0xffffffffa027c094 [ 82.812406] [<ffffffff810003de>] do_one_initcall+0xae/0x1d0 [ 82.812412] [<ffffffff811595a0>] do_init_module+0x5b/0x1cb [ 82.812417] [<ffffffff81106160>] load_module+0x1c20/0x2480 [ 82.812422] [<ffffffff81106bae>] SyS_finit_module+0x7e/0xa0 [ 82.812428] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73 [ 82.812433] -> #0 (&dev->struct_mutex){+.+.+.}: [ 82.812439] [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0 [ 82.812443] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0 [ 82.812456] [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270 [ 82.812460] [<ffffffff81196a14>] mmap_region+0x334/0x580 [ 82.812466] [<ffffffff81196fc4>] do_mmap+0x364/0x410 [ 82.812470] [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0 [ 82.812474] [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220 [ 82.812479] [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20 [ 82.812484] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73 [ 82.812489] other info that might help us debug this: [ 82.812493] Chain exists of: &dev->struct_mutex --> s_active#6 --> &mm->mmap_sem [ 82.812502] Possible unsafe locking scenario: [ 82.812506] CPU0 CPU1 [ 82.812508] ---- ---- [ 82.812510] lock(&mm->mmap_sem); [ 82.812514] lock(s_active#6); [ 82.812519] lock(&mm->mmap_sem); [ 82.812522] lock(&dev->struct_mutex); [ 82.812526] * DEADLOCK * [ 82.812531] 1 lock held by kms_setmode/5859: [ 82.812533] #0: (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0 [ 82.812541] stack backtrace: [ 82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1 [ 82.812550] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015 [ 82.812553] 0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270 [ 82.812560] ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90 [ 82.812566] ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350 [ 82.812573] Call Trace: [ 82.812578] [<ffffffff813f8505>] dump_stack+0x67/0x92 [ 82.812582] [<ffffffff810c84ac>] print_circular_bug+0x1fc/0x310 [ 82.812586] [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0 [ 82.812590] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0 [ 82.812594] [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270 [ 82.812599] [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270 [ 82.812603] [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270 [ 82.812608] [<ffffffff81196a14>] mmap_region+0x334/0x580 [ 82.812612] [<ffffffff81196fc4>] do_mmap+0x364/0x410 [ 82.812616] [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0 [ 82.812629] [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220 [ 82.812633] [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20 [ 82.812637] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73 Highly unlikely though this scenario is, we can avoid the issue entirely by moving the copy operation from out under the kernfs_get_active() tracking by assigning the preallocated buffer its own mutex. The temporary buffer allocation doesn't require mutex locking as it is entirely local. The locked section was extended by the addition of the preallocated buf to speed up md user operations in commit `2b75869bba` Author: NeilBrown <neilb@suse.de> Date: Mon Oct 13 16:41:28 2014 +1100 sysfs/kernfs: allow attributes to request write buffer be pre-allocated. Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: NeilBrown <neilb@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-04-30 10:05:05 -07:00
Aditya Kali	fb3c831565	kernfs: define kernfs_node_dentry Add a new kernfs api is added to lookup the dentry for a particular kernfs path. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2016-02-16 13:04:58 -05:00
Aditya Kali	9f6df573a4	kernfs: Add API to generate relative kernfs path The new function kernfs_path_from_node() generates and returns kernfs path of a given kernfs_node relative to a given parent kernfs_node. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2016-02-16 13:04:58 -05:00
Tejun Heo	bd96f76a24	kernfs: implement kernfs_walk_and_get() Implement kernfs_walk_and_get() which is similar to kernfs_find_and_get() but can walk a path instead of just a name. v2: Use strlcpy() instead of strlen() + memcpy() as suggested by David. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: David Miller <davem@davemloft.net>	2015-11-20 15:55:52 -05:00
Tejun Heo	9acee9c551	kernfs: implement kernfs_path_len() Add a function to determine the path length of a kernfs node. This for now will be used by writeback tracepoint updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-18 15:49:15 -07:00
Linus Torvalds	0cbee99269	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace updates from Eric Biederman: "Long ago and far away when user namespaces where young it was realized that allowing fresh mounts of proc and sysfs with only user namespace permissions could violate the basic rule that only root gets to decide if proc or sysfs should be mounted at all. Some hacks were put in place to reduce the worst of the damage could be done, and the common sense rule was adopted that fresh mounts of proc and sysfs should allow no more than bind mounts of proc and sysfs. Unfortunately that rule has not been fully enforced. There are two kinds of gaps in that enforcement. Only filesystems mounted on empty directories of proc and sysfs should be ignored but the test for empty directories was insufficient. So in my tree directories on proc, sysctl and sysfs that will always be empty are created specially. Every other technique is imperfect as an ordinary directory can have entries added even after a readdir returns and shows that the directory is empty. Special creation of directories for mount points makes the code in the kernel a smidge clearer about it's purpose. I asked container developers from the various container projects to help test this and no holes were found in the set of mount points on proc and sysfs that are created specially. This set of changes also starts enforcing the mount flags of fresh mounts of proc and sysfs are consistent with the existing mount of proc and sysfs. I expected this to be the boring part of the work but unfortunately unprivileged userspace winds up mounting fresh copies of proc and sysfs with noexec and nosuid clear when root set those flags on the previous mount of proc and sysfs. So for now only the atime, read-only and nodev attributes which userspace happens to keep consistent are enforced. Dealing with the noexec and nosuid attributes remains for another time. This set of changes also addresses an issue with how open file descriptors from /proc/<pid>/ns/* are displayed. Recently readlink of /proc/<pid>/fd has been triggering a WARN_ON that has not been meaningful since it was added (as all of the code in the kernel was converted) and is not now actively wrong. There is also a short list of issues that have not been fixed yet that I will mention briefly. It is possible to rename a directory from below to above a bind mount. At which point any directory pointers below the renamed directory can be walked up to the root directory of the filesystem. With user namespaces enabled a bind mount of the bind mount can be created allowing the user to pick a directory whose children they can rename to outside of the bind mount. This is challenging to fix and doubly so because all obvious solutions must touch code that is in the performance part of pathname resolution. As mentioned above there is also a question of how to ensure that developers by accident or with purpose do not introduce exectuable files on sysfs and proc and in doing so introduce security regressions in the current userspace that will not be immediately obvious and as such are likely to require breaking userspace in painful ways once they are recognized" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: vfs: Remove incorrect debugging WARN in prepend_path mnt: Update fs_fully_visible to test for permanently empty directories sysfs: Create mountpoints with sysfs_create_mount_point sysfs: Add support for permanently empty directories to serve as mount points. kernfs: Add support for always empty directories. proc: Allow creating permanently empty directories that serve as mount points sysctl: Allow creating permanently empty directories that serve as mountpoints. fs: Add helper functions for permanently empty directories. vfs: Ignore unlocked mounts in fs_fully_visible mnt: Modify fs_fully_visible to deal with locked ro nodev and atime mnt: Refactor the logic for mounting sysfs and proc in a user namespace	2015-07-03 15:20:57 -07:00
Eric W. Biederman	ea015218f2	kernfs: Add support for always empty directories. Add a new function kernfs_create_empty_dir that can be used to create directory that can not be modified. Update the code to use make_empty_dir_inode when reporting a permanently empty directory to the vfs. Update the code to not allow adding to permanently empty directories. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2015-07-01 10:36:43 -05:00
Tejun Heo	fb02915f47	kernfs: make kernfs_get_inode() public Move kernfs_get_inode() prototype from fs/kernfs/kernfs-internal.h to include/linux/kernfs.h. It obtains the matching inode for a kernfs_node. It will be used by cgroup for inode based permission checks for now but is generally useful. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-06-18 16:54:28 -04:00
Tejun Heo	dfeb0750b6	kernfs: remove KERNFS_STATIC_NAME When a new kernfs node is created, KERNFS_STATIC_NAME is used to avoid making a separate copy of its name. It's currently only used for sysfs attributes whose filenames are required to stay accessible and unchanged. There are rare exceptions where these names are allocated and formatted dynamically but for the vast majority of cases they're consts in the rodata section. Now that kernfs is converted to use kstrdup_const() and kfree_const(), there's little point in keeping KERNFS_STATIC_NAME around. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-13 21:21:36 -08:00
NeilBrown	2b75869bba	sysfs/kernfs: allow attributes to request write buffer be pre-allocated. md/raid allows metadata management to be performed in user-space. A various times, particularly on device failure, the metadata needs to be updated before further writes can be permitted. This means that the user-space program which updates metadata much not block on writeout, and so must not allocate memory. mlockall(MCL_CURRENT\|MCL_FUTURE) and pre-allocation can avoid all memory allocation issues for user-memory, but that does not help kernel memory. Several kernel objects can be pre-allocated. e.g. files opened before any writes to the array are permitted. However some kernel allocation happens in places that cannot be pre-allocated. In particular, writes to sysfs files (to tell md that it can now allow writes to the array) allocate a buffer using GFP_KERNEL. This patch allows attributes to be marked as "PREALLOC". In that case the maximal buffer is allocated when the file is opened, and then used on each write instead of allocating a new buffer. As the same buffer is now shared for all writes on the same file description, the mutex is extended to cover full use of the buffer including the copy_from_user(). The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is inspected by sysfs_add_file_mode_ns() to determine if the file should be marked as requiring prealloc. Despite the comment, we do use ->seq_show together with ->prealloc in this patch. The next patch fixes that. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-11-07 10:53:25 -08:00
Linus Torvalds	40f6123737	Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Mostly fixes for the fallouts from the recent cgroup core changes. The decoupled nature of cgroup dynamic hierarchy management (hierarchies are created dynamically on mount but may or may not be reused once unmounted depending on remaining usages) led to more ugliness being added to kernfs. Hopefully, this is the last of it" * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: break kernfs active protection in cpuset_write_resmask() cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() kernfs: introduce kernfs_pin_sb() cgroup: fix mount failure in a corner case cpuset,mempolicy: fix sleeping function called from invalid context cgroup: fix broken css_has_online_children()	2014-07-10 11:38:23 -07:00
Tejun Heo	ecca47ce82	kernfs: kernfs_notify() must be useable from non-sleepable contexts `d911d98748` ("kernfs: make kernfs_notify() trigger inotify events too") added fsnotify triggering to kernfs_notify() which requires a sleepable context. There are already existing users of kernfs_notify() which invoke it from an atomic context and in general it's silly to require a sleepable context for triggering a notification. The following is an invalid context bug triggerd by md invoking sysfs_notify() from IO completion path. BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1 2 locks held by swapper/1/0: #0: (&(&vblk->vq_lock)->rlock){-.-...}, at: [<ffffffffa0039042>] virtblk_done+0x42/0xe0 [virtio_blk] #1: (&(&bitmap->counts.lock)->rlock){-.....}, at: [<ffffffff81633718>] bitmap_endwrite+0x68/0x240 irq event stamp: 33518 hardirqs last enabled at (33515): [<ffffffff8102544f>] default_idle+0x1f/0x230 hardirqs last disabled at (33516): [<ffffffff818122ed>] common_interrupt+0x6d/0x72 softirqs last enabled at (33518): [<ffffffff810a1272>] _local_bh_enable+0x22/0x50 softirqs last disabled at (33517): [<ffffffff810a29e0>] irq_enter+0x60/0x80 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.0-0.rc2.git2.1.fc21.x86_64 #1 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000000 f90db13964f4ee05 ffff88007d403b80 ffffffff81807b4c 0000000000000000 ffff88007d403ba8 ffffffff810d4f14 0000000000000000 0000000000441800 ffff880078fa1780 ffff88007d403c38 ffffffff8180caf2 Call Trace: <IRQ> [<ffffffff81807b4c>] dump_stack+0x4d/0x66 [<ffffffff810d4f14>] __might_sleep+0x184/0x240 [<ffffffff8180caf2>] mutex_lock_nested+0x42/0x440 [<ffffffff812d76a0>] kernfs_notify+0x90/0x150 [<ffffffff8163377c>] bitmap_endwrite+0xcc/0x240 [<ffffffffa00de863>] close_write+0x93/0xb0 [raid1] [<ffffffffa00df029>] r1_bio_write_done+0x29/0x50 [raid1] [<ffffffffa00e0474>] raid1_end_write_request+0xe4/0x260 [raid1] [<ffffffff813acb8b>] bio_endio+0x6b/0xa0 [<ffffffff813b46c4>] blk_update_request+0x94/0x420 [<ffffffff813bf0ea>] blk_mq_end_io+0x1a/0x70 [<ffffffffa00392c2>] virtblk_request_done+0x32/0x80 [virtio_blk] [<ffffffff813c0648>] __blk_mq_complete_request+0x88/0x120 [<ffffffff813c070a>] blk_mq_complete_request+0x2a/0x30 [<ffffffffa0039066>] virtblk_done+0x66/0xe0 [virtio_blk] [<ffffffffa002535a>] vring_interrupt+0x3a/0xa0 [virtio_ring] [<ffffffff81116177>] handle_irq_event_percpu+0x77/0x340 [<ffffffff8111647d>] handle_irq_event+0x3d/0x60 [<ffffffff81119436>] handle_edge_irq+0x66/0x130 [<ffffffff8101c3e4>] handle_irq+0x84/0x150 [<ffffffff818146ad>] do_IRQ+0x4d/0xe0 [<ffffffff818122f2>] common_interrupt+0x72/0x72 <EOI> [<ffffffff8105f706>] ? native_safe_halt+0x6/0x10 [<ffffffff81025454>] default_idle+0x24/0x230 [<ffffffff81025f9f>] arch_cpu_idle+0xf/0x20 [<ffffffff810f5adc>] cpu_startup_entry+0x37c/0x7b0 [<ffffffff8104df1b>] start_secondary+0x25b/0x300 This patch fixes it by punting the notification delivery through a work item. This ends up adding an extra pointer to kernfs_elem_attr enlarging kernfs_node by a pointer, which is not ideal but not a very big deal either. If this turns out to be an actual issue, we can move kernfs_elem_attr->size to kernfs_node->iattr later. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-07-02 09:32:09 -07:00
Li Zefan	4e26445faa	kernfs: introduce kernfs_pin_sb() kernfs_pin_sb() tries to get a refcnt of the superblock. This will be used by cgroupfs. v2: - make kernfs_pin_sb() return the superblock. - drop kernfs_drop_sb(). tj: Updated the comment a bit. [ This is a prerequisite for a bugfix. ] Cc: <stable@vger.kernel.org> # 3.15 Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-06-30 10:16:25 -04:00
Jianyu Zhan	26fc9cd200	kernfs: move the last knowledge of sysfs out from kernfs There is still one residue of sysfs remaining: the sb_magic SYSFS_MAGIC. However this should be kernfs user specific, so this patch moves it out. Kerrnfs user should specify their magic number while mouting. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-05-27 14:33:17 -07:00
Greg Kroah-Hartman	cbfef53360	Merge 3.15-rc6 into driver-core-next We want the kernfs fixes in this branch as well for testing. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-05-23 10:13:53 +09:00
Tejun Heo	555724a831	kernfs, sysfs, cgroup: restrict extra perm check on open to sysfs The kernfs open method - kernfs_fop_open() - inherited extra permission checks from sysfs. While the vfs layer allows ignoring the read/write permissions checks if the issuer has CAP_DAC_OVERRIDE, sysfs explicitly denied open regardless of the cap if the file doesn't have any of the UGO perms of the requested access or doesn't implement the requested operation. It can be debated whether this was a good idea or not but the behavior is too subtle and dangerous to change at this point. After cgroup got converted to kernfs, this extra perm check also got applied to cgroup breaking libcgroup which opens write-only files with O_RDWR as root. This patch gates the extra open permission check with a new flag KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK and enables it for sysfs. For sysfs, nothing changes. For cgroup, root now can perform any operation regardless of the permissions as it was before kernfs conversion. Note that kernfs still fails unimplemented operations with -EINVAL. While at it, add comments explaining KERNFS_ROOT flags. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andrey Wagin <avagin@gmail.com> Tested-by: Andrey Wagin <avagin@gmail.com> Cc: Li Zefan <lizefan@huawei.com> References: http://lkml.kernel.org/g/CANaxB-xUm3rJ-Cbp72q-rQJO5mZe1qK6qXsQM=vh0U8upJ44+A@mail.gmail.com Fixes: `2bd59d48eb` ("cgroup: convert to kernfs") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-05-13 13:21:40 +02:00
Tejun Heo	7d568a8383	kernfs: implement kernfs_root->supers list Currently, there's no way to find out which super_blocks are associated with a given kernfs_root. Let's implement it - the planned inotify extension to kernfs_notify() needs it. Make kernfs_super_info point back to the super_block and chain it at kernfs_root->supers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-04-25 11:43:31 -07:00
Tejun Heo	b7ce40cff0	kernfs: cache atomic_write_len in kernfs_open_file While implementing atomic_write_len, `4d3773c4bb` ("kernfs: implement kernfs_ops->atomic_write_len") moved data copy from userland inside kernfs_get_active() and kernfs_open_file->mutex so that kernfs_ops->atomic_write_len can be accessed before copying buffer from userland; unfortunately, this could lead to locking order inversion involving mmap_sem if copy_from_user() takes a page fault. ====================================================== [ INFO: possible circular locking dependency detected ] 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 Tainted: G W ------------------------------------------------------- trinity-c236/10658 is trying to acquire lock: (&of->mutex#2){+.+.+.}, at: [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120 but task is already holding lock: (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_sem){++++++}: [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0 [<mm/memory.c:4188>] might_fault+0x7e/0xb0 [<arch/x86/include/asm/uaccess.h:713 fs/kernfs/file.c:291>] kernfs_fop_write+0xd8/0x190 [<fs/read_write.c:473>] vfs_write+0xe3/0x1d0 [<fs/read_write.c:523 fs/read_write.c:515>] SyS_write+0x5d/0xa0 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2 -> #0 (&of->mutex#2){+.+.+.}: [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0 [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510 [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120 [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0 [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430 [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0 [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210 [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&mm->mmap_sem); lock(&of->mutex#2); lock(&mm->mmap_sem); lock(&of->mutex#2); * DEADLOCK * 1 lock held by trinity-c236/10658: #0: (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0 stack backtrace: CPU: 2 PID: 10658 Comm: trinity-c236 Tainted: G W 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 0000000000000000 ffff88011911fa48 ffffffff8438e945 0000000000000000 0000000000000000 ffff88011911fa98 ffffffff811a0109 ffff88011911fab8 ffff88011911fab8 ffff88011911fa98 ffff880119128cc0 ffff880119128cf8 Call Trace: [<lib/dump_stack.c:52>] dump_stack+0x52/0x7f [<kernel/locking/lockdep.c:1213>] print_circular_bug+0x129/0x160 [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560 [<include/linux/spinlock.h:343 mm/slub.c:1933>] ? deactivate_slab+0x511/0x550 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0 [<mm/mmap.c:1552>] ? mmap_region+0x24a/0x5c0 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0 [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120 [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510 [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120 [<kernel/sched/core.c:2477>] ? get_parent_ip+0x11/0x50 [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120 [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120 [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0 [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430 [<mm/util.c:397>] ? vm_mmap_pgoff+0x6e/0xe0 [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0 [<kernel/rcu/update.c:97>] ? __rcu_read_unlock+0x44/0xb0 [<fs/file.c:641>] ? dup_fd+0x3c0/0x3c0 [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210 [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2 Fix it by caching atomic_write_len in kernfs_open_file during open so that it can be determined without accessing kernfs_ops in kernfs_fop_write(). This restores the structure of kernfs_fop_write() before `4d3773c4bb` with updated @len determination logic. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> References: http://lkml.kernel.org/g/53113485.2090407@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-03-08 22:08:29 -08:00
Greg Kroah-Hartman	13df797743	Merge 3.14-rc5 into driver-core-next We want the fixes in here.	2014-03-02 20:09:08 -08:00
Li Zefan	fed95bab8d	sysfs: fix namespace refcnt leak As mount() and kill_sb() is not a one-to-one match, we shoudn't get ns refcnt unconditionally in sysfs_mount(), and instead we should get the refcnt only when kernfs_mount() allocated a new superblock. v2: - Changed the name of the new argument, suggested by Tejun. - Made the argument optional, suggested by Tejun. v3: - Make the new argument as second-to-last arg, suggested by Tejun. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> --- fs/kernfs/mount.c \| 8 +++++++- fs/sysfs/mount.c \| 5 +++-- include/linux/kernfs.h \| 9 +++++---- 3 files changed, 15 insertions(+), 7 deletions(-) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-25 07:37:52 -08:00
Tejun Heo	ba341d55a4	kernfs: add CONFIG_KERNFS As sysfs was kernfs's only user, kernfs has been piggybacking on CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very soon. Introduce a separate config option CONFIG_KERNFS which is to be selected by kernfs users. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 16:08:57 -08:00
Tejun Heo	3eef34ad7d	kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends kernfs_node->parent and ->name are currently marked as "published" indicating that kernfs users may access them directly; however, those fields may get updated by kernfs_rename[_ns]() and unrestricted access may lead to erroneous values or oops. Protect ->parent and ->name updates with a irq-safe spinlock kernfs_rename_lock and implement the following accessors for these fields. * kernfs_name() - format the node's name into the specified buffer * kernfs_path() - format the node's path into the specified buffer * pr_cont_kernfs_name() - pr_cont a node's name (doesn't need buffer) * pr_cont_kernfs_path() - pr_cont a node's path (doesn't need buffer) * kernfs_get_parent() - pin and return a node's parent All can be called under any context. The recursive sysfs_pathname() in fs/sysfs/dir.c is replaced with kernfs_path() and sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of dereferencing parent directly. v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing static inline making it cause a lot of build warnings. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 16:05:35 -08:00
Tejun Heo	0c23b2259a	kernfs: implement kernfs_node_from_dentry(), kernfs_root_from_sb() and kernfs_rename() Implement helpers to determine node from dentry and root from super_block. Also add a kernfs_rename_ns() wrapper which assumes NULL namespace. These generally make sense and will be used by cgroup. v2: Some dummy implementations for !CONFIG_SYSFS was missing. Fixed. Reported by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 16:00:41 -08:00

1 2 3 4

76 Commits