11a3122f6c "block: strip out locking optimization in put_io_context()"
removed ioc_lock depth lockdep annoation along with locking
optimization; however, while recursing from put_io_context() is no
longer possible, ioc_release_fn() may still end up putting the last
reference of another ioc through elevator, which wlil grab ioc->lock
triggering spurious (as the ioc is always different one) A-A deadlock
warning.
As this can only happen one time from ioc_release_fn(), using non-zero
subclass from ioc_release_fn() is enough. Use subclass 1.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We create "bsg" link if q->kobj.sd is not NULL, so remove it only
when the same condition is true.
Fixes:
WARNING: at fs/sysfs/inode.c:323 sysfs_hash_and_remove+0x2b/0x77()
sysfs: can not remove 'bsg', no directory
Call Trace:
[<c0429683>] warn_slowpath_common+0x6a/0x7f
[<c0537a68>] ? sysfs_hash_and_remove+0x2b/0x77
[<c042970b>] warn_slowpath_fmt+0x2b/0x2f
[<c0537a68>] sysfs_hash_and_remove+0x2b/0x77
[<c053969a>] sysfs_remove_link+0x20/0x23
[<c05d88f1>] bsg_unregister_queue+0x40/0x6d
[<c0692263>] __scsi_remove_device+0x31/0x9d
[<c069149f>] scsi_forget_host+0x41/0x52
[<c0689fa9>] scsi_remove_host+0x71/0xe0
[<f7de5945>] quiesce_and_remove_host+0x51/0x83 [usb_storage]
[<f7de5a1e>] usb_stor_disconnect+0x18/0x22 [usb_storage]
[<c06c29de>] usb_unbind_interface+0x4e/0x109
[<c067a80f>] __device_release_driver+0x6b/0xa6
[<c067a861>] device_release_driver+0x17/0x22
[<c067a46a>] bus_remove_device+0xd6/0xe6
[<c06785e2>] device_del+0xf2/0x137
[<c06c101f>] usb_disable_device+0x94/0x1a0
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test. blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.
elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge(). elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge(). While at it, make rq_merge_ok() functions return
bool.
This is to prepare for plug merge update and doesn't introduce any
behavior change.
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue. It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.
While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization. Strip it out. If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq_slice_expired will change saved_workload_slice. It should be called
first so saved_workload_slice is correctly set to 0 after workload type
is changed.
This fixes the code order changed by 54b466e44b.
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
Revert "block: recursive merge requests"
block: Stop using macro stubs for the bio data integrity calls
blockdev: convert some macros to static inlines
fs: remove unneeded plug in mpage_readpages()
block: Add BLKROTATIONAL ioctl
block: Introduce blk_set_stacking_limits function
block: remove WARN_ON_ONCE() in exit_io_context()
block: an exiting task should be allowed to create io_context
block: ioc_cgroup_changed() needs to be exported
block: recursive merge requests
block, cfq: fix empty queue crash caused by request merge
block, cfq: move icq creation and rq->elv.icq association to block core
block, cfq: restructure io_cq creation path for io_context interface cleanup
block, cfq: move io_cq exit/release to blk-ioc.c
block, cfq: move icq cache management to block core
block, cfq: move io_cq lookup to blk-ioc.c
block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
block, cfq: reorganize cfq_io_context into generic and cfq specific parts
block: remove elevator_queue->ops
block: reorder elevator switch sequence
...
Fix up conflicts in:
- block/blk-cgroup.c
Switch from can_attach_task to can_attach
- block/cfq-iosched.c
conflict with now removed cic index changes (we now use q->id instead)
This reverts commit 274193224c.
We have some problems related to selection of empty queues
that need to be resolved, evidence so far points to the
recursive merge logic making either being the cause or at
least the accelerator for this. So revert it for now, until
we figure this out.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linux allows executing the SG_IO ioctl on a partition or LVM volume, and
will pass the command to the underlying block device. This is
well-known, but it is also a large security problem when (via Unix
permissions, ACLs, SELinux or a combination thereof) a program or user
needs to be granted access only to part of the disk.
This patch lets partitions forward a small set of harmless ioctls;
others are logged with printk so that we can see which ioctls are
actually sent. In my tests only CDROM_GET_CAPABILITY actually occurred.
Of course it was being sent to a (partition on a) hard disk, so it would
have failed with ENOTTY and the patch isn't changing anything in
practice. Still, I'm treating it specially to avoid spamming the logs.
In principle, this restriction should include programs running with
CAP_SYS_RAWIO. If for example I let a program access /dev/sda2 and
/dev/sdb, it still should not be able to read/write outside the
boundaries of /dev/sda2 independent of the capabilities. However, for
now programs with CAP_SYS_RAWIO will still be allowed to send the
ioctls. Their actions will still be logged.
This patch does not affect the non-libata IDE driver. That driver
however already tests for bd != bd->bd_contains before issuing some
ioctl; it could be restricted further to forbid these ioctls even for
programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO.
Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Make it also print the command name when warning - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce an ioctl which permits applications to query whether a block
device is rotational.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.
This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.
Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cgroup: fix to allow mounting a hierarchy by name
cgroup: move assignement out of condition in cgroup_attach_proc()
cgroup: Remove task_lock() from cgroup_post_fork()
cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
cgroup: only need to check oldcgrp==newgrp once
cgroup: remove redundant get/put of task struct
cgroup: remove redundant get/put of old css_set from migrate
cgroup: Remove unnecessary task_lock before fetching css_set on migration
cgroup: Drop task_lock(parent) on cgroup_fork()
cgroups: remove redundant get/put of css_set from css_set_check_fetched()
resource cgroups: remove bogus cast
cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
cgroup, cpuset: don't use ss->pre_attach()
cgroup: don't use subsys->can_attach_task() or ->attach_task()
cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
cgroup: improve old cgroup handling in cgroup_attach_proc()
cgroup: always lock threadgroup during migration
threadgroup: extend threadgroup_lock() to cover exit and exec
threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
...
Fix up conflict in kernel/cgroup.c due to commit e0197aae59: "cgroups:
fix a css_set not found bug in cgroup_attach_proc" that already
mentioned that the bug is fixed (differently) in Tejun's cgroup
patchset. This one, in other words.
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
reiserfs: Properly display mount options in /proc/mounts
vfs: prevent remount read-only if pending removes
vfs: count unlinked inodes
vfs: protect remounting superblock read-only
vfs: keep list of mounts for each superblock
vfs: switch ->show_options() to struct dentry *
vfs: switch ->show_path() to struct dentry *
vfs: switch ->show_devname() to struct dentry *
vfs: switch ->show_stats to struct dentry *
switch security_path_chmod() to struct path *
vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
vfs: trim includes a bit
switch mnt_namespace ->root to struct mount
vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
vfs: opencode mntget() mnt_set_mountpoint()
vfs: spread struct mount - remaining argument of next_mnt()
vfs: move fsnotify junk to struct mount
vfs: move mnt_devname
vfs: move mnt_list to struct mount
vfs: switch pnode.h macros to struct mount *
...
We're doing some odd things there, which already messes up various users
(see the net/socket.c code that this removes), and it was going to add
yet more crud to the block layer because of the incorrect error code
translation.
ENOIOCTLCMD is not an error return that should be returned to user mode
from the "ioctl()" system call, but it should *not* be translated as
EINVAL ("Invalid argument"). It should be translated as ENOTTY
("Inappropriate ioctl for device").
That EINVAL confusion has apparently so permeated some code that the
block layer actually checks for it, which is sad. We continue to do so
for now, but add a big comment about how wrong that is, and we should
remove it entirely eventually. In the meantime, this tries to keep the
changes localized to just the EINVAL -> ENOTTY fix, and removing code
that makes it harder to do the right thing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
both callers of device_get_devnode() are only interested in lower 16bits
and nobody tries to return anything wider than 16bit anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.
Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 5e081591 "block: warn if tag is greater than real_max_depth"
cleaned up blk_queue_end_tag() to warn when the tag is truly invalid
(greater than real_max_depth). However, it changed behavior in the tag <
max_depth case to not end the request. Leading to triggering of
BUG_ON(blk_queued_rq(rq)) in the request completion path:
http://marc.info/?l=linux-kernel&m=132204370518629&w=2
In order to allow blk_queue_resize_tags() to shrink the tag space
blk_queue_end_tag() must always complete tags with a value less than
real_max_depth regardless of the current max_depth. The comment about
"handling the shrink case" seems to be what prompted changes in this
space, so remove it and BUG on all invalid tags (made even simpler by
Matthew's suggestion to use an unsigned compare).
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Reported-by: Meelis Roos <mroos@ut.ee>
Reported-by: Ed Nadolski <edmund.nadolski@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>