Commit Graph

1221 Commits

Author SHA1 Message Date
Brian King
be14eb6191 block: Range check cpu in blk_cpu_to_group
While testing CPU DLPAR, the following problem was discovered.
We were DLPAR removing the first CPU, which in this case was
logical CPUs 0-3. CPUs 0-2 were already marked offline and
we were in the process of offlining CPU 3. After marking
the CPU inactive and offline in cpu_disable, but before the
cpu was completely idle (cpu_die), we ended up in __make_request
on CPU 3. There we looked at the topology map to see which CPU
to complete the I/O on and found no CPUs in the cpu_sibling_map.
This resulted in the block layer setting the completion cpu
to be NR_CPUS, which then caused an oops when we tried to
complete the I/O.

Fix this by sanity checking the value we return from blk_cpu_to_group
to be a valid cpu value.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 09:03:21 +02:00
Jens Axboe
5dd531a03a block: add function call to switch the IO scheduler from a driver
Currently drivers must do an elevator_exit() + elevator_init()
to switch IO schedulers. There are a few problems with this:

- Since commit 1abec4fdbb,
  elevator_init() requires a zeroed out q->elevator
  pointer. The two existing in-kernel users don't do that.

- It will only work at initialization time, since using the
  above two-staged construct does not properly quisce the queue.

So add elevator_change() which takes care of this, and convert
the elv_iosched_store() sysfs interface to use this helper as well.

Reported-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Reported-by: Kevin Vigor <kevin@vigor.nu>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-23 13:52:19 +02:00
Jiri Slaby
5e00d1b5b4 BLOCK: fix bio.bi_rw handling
Return of the bi_rw tests is no longer bool after commit 74450be1. But
results of such tests are stored in bools. This doesn't fit in there
for some compilers (gcc 4.5 here), so either use !! magic to get real
bools or use ulong where the result is assigned somewhere.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-23 12:33:10 +02:00
Xiaotian Feng
c87ffbb812 block: put dev->kobj in blk_register_queue fail path
kernel needs to kobject_put on dev->kobj if elv_register_queue fails.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: David Teigland <teigland@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-23 12:30:29 +02:00
Vivek Goyal
c4e7893ebc cfq-iosched: blktrace print per slice sector stats
o Divyesh had gotten rid of this code in the past. I want to re-introduce it
  back as it helps me a lot during debugging.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Divyesh Shah <dpshah@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-23 12:25:03 +02:00
Vivek Goyal
80bdf0c78f cfq-iosched: Implement tunable group_idle
o Implement a new tunable group_idle, which allows idling on the group
  instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
  on the individual queues but idle on the group. This way on fast storage
  we can get fairness between groups at the same time overall throughput
  improves.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-23 12:24:26 +02:00
Vivek Goyal
02b35081fc cfq-iosched: Do group share accounting in IOPS when slice_idle=0
o Implement another CFQ mode where we charge group in terms of number
  of requests dispatched instead of measuring the time. Measuring in terms
  of time is not possible when we are driving deeper queue depths and there
  are requests from multiple cfq queues in the request queue.

o This mode currently gets activated if one sets slice_idle=0 and associated
  disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
  most of the queues will dispatch 1 or more requests and then cfq queue
  expiry happens and we don't have a way to measure time. So start providing
  fairness in terms of IOPS.

o Currently IOPS mode works only with cfq group scheduling. CFQ is following
  different scheduling algorithms for queue and group scheduling. These IOPS
  stats are used only for group scheduling hence in non-croup mode nothing
  should change.

o For CFQ group scheduling one can disable slice idling so that we don't idle
  on queue and drive deeper request queue depths (achieving better throughput),
  at the same time group idle is enabled so one should get service
  differentiation among groups.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-23 12:23:53 +02:00
Vivek Goyal
b6508c1618 cfq-iosched: Do not idle if slice_idle=0
Do not idle either on cfq queue or service tree if slice_idle=0. User does
not want any queue or service tree idling. Currently even if slice_idle=0,
we were waiting for request to finish before expiring the queue and that
can lead to lower queue depths.

Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-23 12:23:33 +02:00
Ciju Rajan K
96aa1b419d blkio: Fix return code for mkdir calls
If the cgroup hierarchy for blkio control groups is deeper than two
levels, kernel should not allow the creation of further levels. mkdir
system call does not except EINVAL as a return value. This patch
replaces EINVAL with more appropriate EPERM

Signed-off-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-23 10:56:30 +02:00
Adrian Hunter
8d57a98ccd block: add secure discard
Secure discard is the same as discard except that all copies of the
discarded sectors (perhaps created by garbage collection) must also be
erased.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12 08:43:30 -07:00
Dmitry Monakhov
18edc8eaa6 blkdev: fix blkdev_issue_zeroout return value
- If function called without barrier option retvalue is incorrect

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-08 12:31:08 -04:00
ike Snitzer
3383977fad block: update request stacking methods to support discards
Propagate REQ_DISCARD in cmd_flags when cloning a discard request.
Skip blk_rq_check_limits's existing checks for discard requests because
discard limits will have already been checked in blkdev_issue_discard.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-08 12:11:33 -04:00
FUJITA Tomonori
16f2319fd6 block: set up rq->rq_disk properly for flush requests
q->bar_rq.rq_disk is NULL. Use the rq_disk of the original request
instead.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:52:41 +02:00
FUJITA Tomonori
28e18d0188 block: set REQ_TYPE_FS on flush requests
the block layer doesn't set rq->cmd_type on flush requests. By
definition, it should be REQ_TYPE_FS (the lower layers build a command
and interpret the result of it, that is, the block layer doesn't know
the details).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:52:40 +02:00
Jens Axboe
10d1f9e2cc block: fix problem with sending down discard that isn't of correct granularity
If the queue doesn't have a limit set, or it just set UINT_MAX like
we default to, we coud be sending down a discard request that isn't
of the correct granularity if the block size is > 512b.

Fix this by adjusting max_discard_sectors down to the proper
alignment.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:26:33 +02:00
Dave Chinner
f10d9f617a blkdev: check for valid request queue before issuing flush
Issuing a blkdev_issue_flush() on an unconfigured loop device causes a panic as
q->make_request_fn is not configured. This can occur when trying to mount the
unconfigured loop device as an XFS filesystem. There are no guards that catch
the bio before the request function is called because we don't add a payload to
the bio. Instead, manually check this case as soon as we have a pointer to the
queue to flush.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:26:29 +02:00
Arnd Bergmann
15392efb9d block: remove BKL from partition ioctls
The blkpg_ioctl and blkdev_reread_part access fields of
the bdev and gendisk structures, yet they always do so
under the protection of bdev->bd_mutex, which seems
sufficient.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
cked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:26:08 +02:00
Arnd Bergmann
6de4370310 block: remove BKL from BLKROSET and BLKFLSBUF
We only call the functions set_device_ro(),
invalidate_bdev(), sync_filesystem() and sync_blockdev()
while holding the BKL in these commands. All
of these are also done in other code paths without
the BKL, which leads me to the conclusion that
the BKL is not needed here either.

The reason we hold it here is that it was originally
pushed down into the ioctl function from vfs_ioctl.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:26:08 +02:00
Arnd Bergmann
62c2a7d969 block: push BKL into blktrace ioctls
The blktrace driver currently needs the BKL, but
we should not need to take that in the block layer,
so just push it down into the driver itself.

It is quite likely that the BKL is not actually
required in blktrace code and could be removed
in a follow-on patch.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:26:08 +02:00
Arnd Bergmann
8a6cfeb6de block: push down BKL into .locked_ioctl
As a preparation for the removal of the big kernel
lock in the block layer, this removes the BKL
from the common ioctl handling code, moving it
into every single driver still using it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:25:00 +02:00
FUJITA Tomonori
00fff26539 block: remove q->prepare_flush_fn completely
This removes q->prepare_flush_fn completely (changes the
blk_queue_ordered API).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:24:15 +02:00
FUJITA Tomonori
b6a903151d block: permit PREFLUSH and POSTFLUSH without prepare_flush_fn
This is preparation for removing q->prepare_flush_fn.

Temporarily, blk_queue_ordered() permits QUEUE_ORDERED_DO_PREFLUSH and
QUEUE_ORDERED_DO_POSTFLUSH without prepare_flush_fn.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:23:56 +02:00
FUJITA Tomonori
8749534fe6 block: introduce REQ_FLUSH flag
SCSI-ml needs a way to mark a request as flush request in
q->prepare_flush_fn because it needs to identify them later (e.g. in
q->request_fn or prep_rq_fn).

queue_flush sets REQ_HARDBARRIER in rq->cmd_flags however the block
layer also sends normal REQ_TYPE_FS requests with REQ_HARDBARRIER. So
SCSI-ml can't use REQ_HARDBARRIER to identify flush requests.

We could change the block layer to clear REQ_HARDBARRIER bit before
sending non flush requests to the lower layers. However, intorudcing
the new flag looks cleaner (surely easier).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:23:53 +02:00
James Bottomley
28018c242a block: implement an unprep function corresponding directly to prep
Reviewed-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:23:47 +02:00
Jens Axboe
3ffb52e73b block: fixup missing conversion from BIO_RW_DISCARD to REQ_DISCARD
Didn't cause a merge conflict, so fixed this one up manually
post merge.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:23:41 +02:00