Block layer used to merge requests and bios with different failfast
settings. This caused regular IOs to fail prematurely when they were
merged into failfast requests for readahead.
Niel Lambrechts could trigger the problem semi-reliably on ext4 when
resuming from STR. ext4 uses readahead when reading inodes and
combined with the deterministic extra SATA PHY exception cycle during
resume on the specific configuration, non-readahead inode read would
fail causing ext4 errors. Please read the following thread for
details.
http://lkml.org/lkml/2009/5/23/21
This patch makes block layer reject merging if the failfast settings
don't match. This is correct but likely to lower IO performance by
preventing regular IOs from mingling into surrounding readahead
requests. Changes to allow such mixed merges and handle errors
correctly will be added later.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Niel Lambrechts <niel.lambrechts@gmail.com>
Cc: Theodore Tso <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
With the changes for falling back to an oom_cfqq, we never fail
to find/allocate a queue in cfq_get_queue(). So remove the check.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The next_ordered flag is only meaningful for devices that use __make_request.
So move the test against next_ordered out of generic code and in to
__make_request
Since this test was added, barriers have not worked on md or any
devices that don't use __make_request and so don't bother to set
next_ordered. (dm explicitly sets something other than
QUEUE_ORDERED_NONE since
commit 99360b4c18
but notes in the comments that it is otherwise meaningless).
Cc: Ken Milmore <ken.milmore@googlemail.com>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The initial patches to support this through sysfs export were broken
and have been if 0'ed out in any release. So lets just kill the code
and reclaim some space in struct request_queue, if anyone would later
like to fixup the sysfs bits, the git history can easily restore
the removed bits.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs. Each bip slab
has an embedded bio_vec array at the end. This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version. Only the largest bip slab is backed by a mempool. The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
Setup an emergency fallback cfqq that we allocate at IO scheduler init
time. If the slab allocation fails in cfq_find_alloc_queue(), we'll just
punt IO to that cfqq instead. This ensures that cfq_find_alloc_queue()
never fails without having to ensure free memory.
On cfqq lookup, always try to allocate a new cfqq if the given cfq io
context has the oom_cfqq assigned. This ensures that we only temporarily
punt to this shared queue.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We're going to be needing that init code outside of that function
to get rid of the __GFP_NOFAIL in cfqq allocation.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The SMP handler (sas_smp_request) was fixed to use the block API
properly, so we don't need this workaround to avoid blk_put_request()
warning.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Warning(block/blk-settings.c:108): No description found for parameter 'lim'
Warning(block/blk-settings.c:108): Excess function parameter 'limits' description in 'blk_set_default_limits'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".
Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits)
debugfs: use specified mode to possibly mark files read/write only
debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem.
xen: remove driver_data direct access of struct device from more drivers
usb: gadget: at91_udc: remove driver_data direct access of struct device
uml: remove driver_data direct access of struct device
block/ps3: remove driver_data direct access of struct device
s390: remove driver_data direct access of struct device
parport: remove driver_data direct access of struct device
parisc: remove driver_data direct access of struct device
of_serial: remove driver_data direct access of struct device
mips: remove driver_data direct access of struct device
ipmi: remove driver_data direct access of struct device
infiniband: ehca: remove driver_data direct access of struct device
ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device
hvcs: remove driver_data direct access of struct device
xen block: remove driver_data direct access of struct device
thermal: remove driver_data direct access of struct device
scsi: remove driver_data direct access of struct device
pcmcia: remove driver_data direct access of struct device
PCIE: remove driver_data direct access of struct device
...
Manually fix up trivial conflicts due to different direct driver_data
direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}
When porting blktrace to tracepoints, we changed to trace/block.h
for trace prober declarations.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
DM reuses the request queue when swapping in a new device table
Introduce blk_set_default_limits() which can be used to reset the the
queue_limits prior to stacking devices.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
I noticed a blank line in blktrace output. This patch fixes that.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This adds support to the BSG driver to report the proper device name to
userspace for the bsg devices.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This adds support for block drivers to report their requested nodename
to userspace. It also updates a number of block drivers to provide the
needed subdirectory and device name to be used for them.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
block: add request clone interface (v2)
floppy: fix hibernation
ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
fs/bio.c: add missing __user annotation
block: prevent possible io_context->refcount overflow
Add serial number support for virtio_blk, V4a
block: Add missing bounce_pfn stacking and fix comments
Revert "block: Fix bounce limit setting in DM"
cciss: decode unit attention in SCSI error handling code
cciss: Remove no longer needed sendcmd reject processing code
cciss: change SCSI error handling routines to work with interrupts enabled.
cciss: separate error processing and command retrying code in sendcmd_withirq_core()
cciss: factor out fix target status processing code from sendcmd functions
cciss: simplify interface of sendcmd() and sendcmd_withirq()
cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
block: needs to set the residual length of a bidi request
Revert "block: implement blkdev_readpages"
block: Fix bounce limit setting in DM
Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
...
Manually fix conflicts with tracing updates in:
block/blk-sysfs.c
drivers/ide/ide-atapi.c
drivers/ide/ide-cd.c
drivers/ide/ide-floppy.c
drivers/ide/ide-tape.c
include/trace/events/block.h
kernel/trace/blktrace.c
* 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (28 commits)
ide-tape: fix debug call
alim15x3: Remove historical hacks, re-enable init_hwif for PowerPC
ide-dma: don't reset request fields on dma_timeout_retry()
ide: drop rq->data handling from ide_map_sg()
ide-atapi: kill unused fields and callbacks
ide-tape: simplify read/write functions
ide-tape: use byte size instead of sectors on rw issue functions
ide-tape: unify r/w init paths
ide-tape: kill idetape_bh
ide-tape: use standard data transfer mechanism
ide-tape: use single continuous buffer
ide-atapi,tape,floppy: allow ->pc_callback() to change rq->data_len
ide-tape,floppy: fix failed command completion after request sense
ide-pm: don't abuse rq->data
ide-cd,atapi: use bio for internal commands
ide-atapi: convert ide-{floppy,tape} to using preallocated sense buffer
ide-cd: convert to using generic sense request
ide: add helpers for preparing sense requests
ide-cd: don't abuse rq->buffer
ide-atapi: don't abuse rq->buffer
...
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
Revert "x86, bts: reenable ptrace branch trace support"
tracing: do not translate event helper macros in print format
ftrace/documentation: fix typo in function grapher name
tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
tracing: add protection around module events unload
tracing: add trace_seq_vprint interface
tracing: fix the block trace points print size
tracing/events: convert block trace points to TRACE_EVENT()
ring-buffer: fix ret in rb_add_time_stamp
ring-buffer: pass in lockdep class key for reader_lock
tracing: add annotation to what type of stack trace is recorded
tracing: fix multiple use of __print_flags and __print_symbolic
tracing/events: fix output format of user stack
tracing/events: fix output format of kernel stack
tracing/trace_stack: fix the number of entries in the header
ring-buffer: discard timestamps that are at the start of the buffer
ring-buffer: try to discard unneeded timestamps
ring-buffer: fix bug in ring_buffer_discard_commit
ftrace: do not profile functions when disabled
tracing: make trace pipe recognize latency format flag
...
Currently io_context has an atomic_t(32-bit) as refcount. In the case of
cfq, for each device against whcih a task does I/O, a reference to the
io_context would be taken. And when there are multiple process sharing
io_contexts(CLONE_IO) would also have a reference to the same io_context.
Theoretically the possible maximum number of processes sharing the same
io_context + the number of disks/cfq_data referring to the same io_context
can overflow the 32-bit counter on a very high-end machine.
Even though it is an improbable case, let us make it atomic_long_t.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>