Commit Graph

564 Commits

Author SHA1 Message Date
Jens Axboe
d6de8be711 cfq-iosched: fix RCU problem in cfq_cic_lookup()
cfq_cic_lookup() needs to properly protect ioc->ioc_data before
dereferencing it and also exclude updaters of ioc->ioc_data as well.

Also add a number of comments documenting why the existing RCU usage
is OK.

Thanks a lot to "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> for
review and comments!

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:28 +02:00
Jens Axboe
64565911cd block: make blktrace use per-cpu buffers for message notes
Currently it uses a single static char array, but that risks
being corrupted when multiple users issue message notes at the
same time. Make the buffers dynamically allocated when the trace
is setup and make them per-cpu instead.

The default max message size of 1k is also very large, the
interface is mainly for small text notes. So shrink it to 128 bytes.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Alan D. Brunelle
4722dc52a8 Added in elevator switch message to blktrace stream
Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Alan D. Brunelle
9d5f09a424 Added in MESSAGE notes for blktraces
Allows messages to be inserted into blktrace streams.

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Richard Kennedy
be754d2c21 block: reorder cfq_queue to save space on 64bit builds
saves 8 bytes of padding & increases objects/slab from 30 to 32 on my
AMD64 config

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Zhang, Yanmin
05caf8dbc1 block: Move the second call to get_request to the end of the loop
In function get_request_wait, the second call to get_request could be
moved to the end of the while loop, because if the first call to
get_request fails, the second call will fail without sleep.

Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Neil Brown
e7e72bf641 Remove blkdev warning triggered by using md
As setting and clearing queue flags now requires that we hold a spinlock
on the queue, and as blk_queue_stack_limits is called without that lock,
get the lock inside blk_queue_stack_limits.

For blk_queue_stack_limits to be able to find the right lock, each md
personality needs to set q->queue_lock to point to the appropriate lock.
Those personalities which didn't previously use a spin_lock, us
q->__queue_lock.  So always initialise that lock when allocated.

With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
longer cause warnings as it will be clear that the proper lock is held.

Thanks to Dan Williams for review and fixing the silly bugs.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Alistair John Strachan <alistair@devzero.co.uk>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jacek Luczak <difrost.kernel@gmail.com>
Cc: Prakash Punnoor <prakash@punnoor.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:15 -07:00
Kay Sievers
30f2f0eb4b block: do_mounts - accept root=<non-existant partition>
Some devices, like md, may create partitions only at first access,
so allow root= to be set to a valid non-existant partition of an
existing disk. This applies only to non-initramfs root mounting.

This fixes a regression from 2.6.24 which did allow this to happen and
broke some users machines :(

Acked-by: Neil Brown <neilb@suse.de>
Tested-by: Joao Luis Meloni Assirati <assirati@nonada.if.usp.br>
Cc: stable <stable@kernel.org>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-05-14 10:37:57 -07:00
Jean Delvare
f36f21ecca Fix misuses of bdevname()
bdevname() fills the buffer that it is given as a parameter, so calling
strcpy() or snprintf() on the returned value is redundant (and probably not
guaranteed to work - I don't think strcpy and snprintf support overlapping
buffers.)

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:26 -07:00
Jens Axboe
28f13702f0 block: avoid duplicate calls to get_part() in disk stat code
get_part() is fairly expensive, as it O(N) loops over partitions
to find the right one. In lots of normal IO paths we end up looking
up the partition twice, to make matters even worse. Change the
stat add code to accept a passed in partition instead.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-07 10:15:46 +02:00
Jens Axboe
6d63c27557 cfq-iosched: make io priorities inherit CPU scheduling class as well as nice
We currently set all processes to the best-effort scheduling class,
regardless of what CPU scheduling class they belong to. Improve that
so that we correctly track idle and rt scheduling classes as well.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-07 09:51:23 +02:00
Jens Axboe
dbaf2c003e block: optimize generic_unplug_device()
Original patch from Mikulas Patocka <mpatocka@redhat.com>

Mike Anderson was doing an OLTP benchmark on a computer with 48 physical
disks mapped to one logical device via device mapper.

He found that there was a slowdown on request_queue->lock in function
generic_unplug_device. The slowdown is caused by the fact that when some
code calls unplug on the device mapper, device mapper calls unplug on all
physical disks. These unplug calls take the lock, find that the queue is
already unplugged, release the lock and exit.

With the below patch, performance of the benchmark was increased by 18%
(the whole OLTP application, not just block layer microbenchmarks).

So I'm submitting this patch for upstream. I think the patch is correct,
because when more threads call simultaneously plug and unplug, it is
unspecified, if the queue is or isn't plugged (so the patch can't make
this worse). And the caller that plugged the queue should unplug it
anyway. (if it doesn't, there's 3ms timeout).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-07 09:48:17 +02:00
Jens Axboe
2cdf79cafb block: get rid of likely/unlikely predictions in merge logic
They tend to depend a lot on the workload, so not a clear-cut
likely or unlikely fit.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-07 09:33:55 +02:00
Jens Axboe
07416d29bc cfq-iosched: fix RCU race in the cfq io_context destructor handling
put_io_context() drops the RCU read lock before calling into cfq_dtor(),
however we need to hold off freeing there before grabbing and
dereferencing the first object on the list.

So extend the rcu_read_lock() scope to cover the calling of cfq_dtor(),
and optimize cfq_free_io_context() to use a new variant for
call_for_each_cic() that assumes the RCU read lock is already held.

Hit in the wild by Alexey Dobriyan <adobriyan@gmail.com>

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-07 09:28:57 +02:00
Jens Axboe
aa94b5371f block: adjust tagging function queue bit locking
For most initialization purposes, calling blk_queue_init_tags() without
the queue lock held is OK. Only if called for resizing an existing map
must the lock be held. Ditto for tag cleanup, the maps are reference
counted.

So switch the general queue flag setting to the unlocked variant, but
retain the locked variant for resizing.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-07 09:27:43 +02:00
Jens Axboe
bf0f97025c block: sysfs store function needs to grab queue_lock and use queue_flag_*()
Concurrency isn't a big deal here since we have requests in flight
at this point, but do the locked variant to set a better example.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-07 09:09:39 +02:00
Linus Torvalds
d626e3bf72 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6:
  [SCSI] aic94xx: fix section mismatch
  [SCSI] u14-34f: Fix 32bit only problem
  [SCSI] dpt_i2o: sysfs code
  [SCSI] dpt_i2o: 64 bit support
  [SCSI] dpt_i2o: move from virt_to_bus/bus_to_virt to dma_alloc_coherent
  [SCSI] dpt_i2o: use standard __init / __exit code
  [SCSI] megaraid_sas: fix suspend/resume sections
  [SCSI] aacraid: Add Power Management support
  [SCSI] aacraid: Fix jbod operations scan issues
  [SCSI] aacraid: Fix warning about macro side-effects
  [SCSI] add support for variable length extended commands
  [SCSI] Let scsi_cmnd->cmnd use request->cmd buffer
  [SCSI] bsg: add large command support
  [SCSI] aacraid: Fix down_interruptible() to check the return value correctly
  [SCSI] megaraid_sas; Update the Version and Changelog
  [SCSI] ibmvscsi: Handle non SCSI error status
  [SCSI] bug fix for free list handling
  [SCSI] ipr: Rename ipr's state scsi host attribute to prevent collisions
  [SCSI] megaraid_mbox: fix Dell CERC firmware problem
2008-05-02 13:52:35 -07:00
Boaz Harrosh
db4742dd8f [SCSI] add support for variable length extended commands
Add support for variable-length, extended, and vendor specific
CDBs to scsi-ml. It is now possible for initiators and ULD's
to issue these types of commands. LLDs need not change much.
All they need is to raise the .max_cmd_len to the longest command
they support (see iscsi patch).

- clean-up some code paths that did not expect commands to be
  larger than 16, and change cmd_len members' type to short as
  char is not enough.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-05-02 11:33:25 -05:00
FUJITA Tomonori
9f5de6b105 [SCSI] bsg: add large command support
This enables bsg to handle the request length larger than BLK_MAX_CDB
(mainly for the variable length CDB format).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-05-02 10:17:35 -05:00
Harvey Harrison
24c03d47d0 block: remove remaining __FUNCTION__ occurrences
__FUNCTION__ is gcc specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:04:02 -07:00
Peter Zijlstra
cf0ca9fe5d mm: bdi: export BDI attributes in sysfs
Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
This allows us to see and set the various BDI specific variables.

In particular this properly exposes the read-ahead window for all relevant
users and /sys/block/<block>/queue/read_ahead_kb should be deprecated.

With patient help from Kay Sievers and Greg KH

[mszeredi@suse.cz]

 - split off NFS and FUSE changes into separate patches
 - document new sysfs attributes under Documentation/ABI
 - do bdi_class_init as a core_initcall, otherwise the "default" BDI
   won't be initialized
 - remove bdi_init_fmt macro, it's not used very much

[akpm@linux-foundation.org: fix ia64 warning]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Greg KH <greg@kroah.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:49 -07:00
Alan D. Brunelle
ac9fafa124 block: Skip I/O merges when disabled
The block I/O + elevator + I/O scheduler code spend a lot of time trying
to merge I/Os -- rightfully so under "normal" circumstances. However,
if one were to know that the incoming I/O stream was /very/ random in
nature, the cycles are wasted.

This patch adds a per-request_queue tunable that (when set) disables
merge attempts (beyond the simple one-hit cache check), thus freeing up
a non-trivial amount of CPU cycles.

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:55 +02:00
FUJITA Tomonori
d7e3c3249e block: add large command support
This patch changes rq->cmd from the static array to a pointer to
support large commands.

We rarely handle large commands. So for optimization, a struct request
still has a static array for a command. rq_init sets rq->cmd pointer
to the static array.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:55 +02:00
FUJITA Tomonori
d34c87e4ba block: replace sizeof(rq->cmd) with BLK_MAX_CDB
This is a preparation for changing rq->cmd from the static array to a
pointer.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:55 +02:00
FUJITA Tomonori
2a4aa30c5f block: rename and export rq_init()
This rename rq_init() blk_rq_init() and export it. Any path that hands
the request to the block layer needs to call it to initialize the
request.

This is a preparation for large command support, which needs to
initialize the request in a proper way (that is, just doing a memset()
will not work).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:55 +02:00