Commit Graph

365 Commits

Author SHA1 Message Date
Linus Torvalds
c1101cbc7d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
 "This is the bulk of the s390 patches for the 3.11 merge window.

  Notable enhancements are: the block timeout patches for dasd from
  Hannes, and more work on the PCI support front.  In addition some
  cleanup and the usual bug fixing."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
  s390/dasd: Fail all requests when DASD_FLAG_ABORTIO is set
  s390/dasd: Add 'timeout' attribute
  block: check for timeout function in blk_rq_timed_out()
  block/dasd: detailed I/O errors
  s390/dasd: Reduce amount of messages for specific errors
  s390/dasd: Implement block timeout handling
  s390/dasd: process all requests in the device tasklet
  s390/dasd: make number of retries configurable
  s390/dasd: Clarify comment
  s390/hwsampler: Updated misleading member names in hws_data_entry
  s390/appldata_net_sum: do not use static data
  s390/appldata_mem: do not use static data
  s390/vmwatchdog: do not use static data
  s390/airq: simplify adapter interrupt code
  s390/pci: remove per device debug attribute
  s390/dma: remove gratuitous brackets
  s390/facility: decompose test_facility()
  s390/sclp: remove duplicated include from sclp_ctl.c
  s390/irq: store interrupt information in pt_regs
  s390/drivers: Cocci spatch "ptr_ret.spatch"
  ...
2013-07-03 11:08:24 -07:00
Linus Torvalds
f317ff9eed Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "Surprisingly, Lai and I didn't break too many things implementing
  custom pools and stuff last time around and there aren't any follow-up
  changes necessary at this point.

  The only change in this pull request is Viresh's patches to make some
  per-cpu workqueues to behave as unbound workqueues dependent on a boot
  param whose default can be configured via a config option.  This leads
  to higher processing overhead / lower bandwidth as more work items are
  bounced across CPUs; however, it can lead to noticeable powersave in
  certain configurations - ~10% w/ idlish constant workload on a
  big.LITTLE configuration according to Viresh.

  This is because per-cpu workqueues interfere with how the scheduler
  perceives whether or not each CPU is idle by forcing pinned tasks on
  them, which makes the scheduler's power-aware scheduling decisions
  less effective.

  Its effectiveness is likely less pronounced on homogenous
  configurations and this type of optimization can probably be made
  automatic; however, the changes are pretty minimal and the affected
  workqueues are clearly marked, so it's an easy gain for some
  configurations for the time being with pretty unintrusive changes."

* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  fbcon: queue work on power efficient wq
  block: queue work on power efficient wq
  PHYLIB: queue work on system_power_efficient_wq
  workqueue: Add system wide power_efficient workqueues
  workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues
2013-07-02 19:53:30 -07:00
Hannes Reinecke
d1ffc1f866 block/dasd: detailed I/O errors
The DASD driver is using FASTFAIL as an equivalent to the
transport errors in SCSI. And the 'steal lock' function maps
roughly to a reservation error. So we should be returning the
appropriate error codes when completing a request.

Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-07-01 17:31:22 +02:00
Aaron Lu
c60855cdb9 blkpm: avoid sleep when holding queue lock
In blk_post_runtime_resume, an autosuspend request will be initiated for
the device. Since we are holding the queue lock, we can't sleep and thus
we should use the async version to initiate an autosuspend, i.e.
pm_request_suspend instead of pm_runtime_suspend, which might sleep.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-17 10:00:43 +02:00
Viresh Kumar
695588f945 block: queue work on power efficient wq
Block layer uses workqueues for multiple purposes. There is no real dependency
of scheduling these on the cpu which scheduled them.

On a idle system, it is observed that and idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.

This patch replaces normal workqueues with power efficient versions.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-14 10:50:07 -07:00
Linus Torvalds
4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Linus Torvalds
0a82a8d132 Revert "block: add missing block_bio_complete() tracepoint"
This reverts commit 3a366e614d.

Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.

Jens says:
 "It's not quite clear why that is yet, so I think we should just revert
  the commit for 3.9 final (which I'm assuming is pretty close).

  The wifi is crap at the LSF hotel, so sending this email instead of
  queueing up a revert and pull request."

Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18 09:00:26 -07:00
Jens Axboe
705cd0ea1c Merge branch 'for-jens' of http://evilpiepirate.org/git/linux-bcache into for-3.10/core
This contains Kents prep work for the immutable bio_vecs.
2013-03-24 21:38:59 -06:00
Kent Overstreet
f73a1c7d11 block: Add bio_end_sector()
Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2013-03-23 14:15:29 -07:00
Kent Overstreet
f79ea41614 block: Refactor blk_update_request()
Converts it to use bio_advance(), simplifying it quite a bit in the
process.

Note that req_bio_endio() now always calls bio_advance() - which means
it always loops over the biovec, not just on partial completions. Don't
expect it to affect performance, but worth noting.

Tested it by forcing partial updates, and dumping before and after on
various bio/bvec fields when doing a partial update.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
2013-03-23 14:15:28 -07:00
Lin Ming
c8158819d5 block: implement runtime pm strategy
When a request is added:
    If device is suspended or is suspending and the request is not a
    PM request, resume the device.

When the last request finishes:
    Call pm_runtime_mark_last_busy().

When pick a request:
    If device is resuming/suspending, then only PM request is allowed
    to go.

The idea and API is designed by Alan Stern and described here:
http://marc.info/?l=linux-scsi&m=133727953625963&w=2

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 22:22:15 -06:00
Lin Ming
6c95466758 block: add runtime pm helpers
Add runtime pm helper functions:

void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
  - Initialization function for drivers to call.

int blk_pre_runtime_suspend(struct request_queue *q)
  - If any requests are in the queue, mark last busy and return -EBUSY.
    Otherwise set q->rpm_status to RPM_SUSPENDING and return 0.

void blk_post_runtime_suspend(struct request_queue *q, int err)
  - If the suspend succeeded then set q->rpm_status to RPM_SUSPENDED.
    Otherwise set it to RPM_ACTIVE and mark last busy.

void blk_pre_runtime_resume(struct request_queue *q)
  - Set q->rpm_status to RPM_RESUMING.

void blk_post_runtime_resume(struct request_queue *q, int err)
  - If the resume succeeded then set q->rpm_status to RPM_ACTIVE
    and call __blk_run_queue, then mark last busy and autosuspend.
    Otherwise set q->rpm_status to RPM_SUSPENDED.

The idea and API is designed by Alan Stern and described here:
http://marc.info/?l=linux-scsi&m=133727953625963&w=2

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 22:22:15 -06:00
Linus Torvalds
ee89f81252 Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block
Pull block IO core bits from Jens Axboe:
 "Below are the core block IO bits for 3.9.  It was delayed a few days
  since my workstation kept crashing every 2-8h after pulling it into
  current -git, but turns out it is a bug in the new pstate code (divide
  by zero, will report separately).  In any case, it contains:

   - The big cfq/blkcg update from Tejun and and Vivek.

   - Additional block and writeback tracepoints from Tejun.

   - Improvement of the should sort (based on queues) logic in the plug
     flushing.

   - _io() variants of the wait_for_completion() interface, using
     io_schedule() instead of schedule() to contribute to io wait
     properly.

   - Various little fixes.

  You'll get two trivial merge conflicts, which should be easy enough to
  fix up"

Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42: "hlist: drop the node parameter from iterators").

* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
  block: remove redundant check to bd_openers()
  block: use i_size_write() in bd_set_size()
  cfq: fix lock imbalance with failed allocations
  drivers/block/swim3.c: fix null pointer dereference
  block: don't select PERCPU_RWSEM
  block: account iowait time when waiting for completion of IO request
  sched: add wait_for_completion_io[_timeout]
  writeback: add more tracepoints
  block: add block_{touch|dirty}_buffer tracepoint
  buffer: make touch_buffer() an exported function
  block: add @req to bio_{front|back}_merge tracepoints
  block: add missing block_bio_complete() tracepoint
  block: Remove should_sort judgement when flush blk_plug
  block,elevator: use new hashtable implementation
  cfq-iosched: add hierarchical cfq_group statistics
  cfq-iosched: collect stats from dead cfqgs
  cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
  blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
  block: RCU free request_queue
  blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
  ...
2013-02-28 12:52:24 -08:00
Darrick J. Wong
ffecfd1a72 block: optionally snapshot page contents to provide stable pages during write
This provides a band-aid to provide stable page writes on jbd without
needing to backport the fixed locking and page writeback bit handling
schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
page contents instead of waiting.

For those wondering about the ext3 bandage -- fixing the jbd locking
(which was done as part of ext4dev years ago) is a lot of surgery, and
setting PG_writeback on data pages when we actually hold the page lock
dropped ext3 performance by nearly an order of magnitude.  If we're
going to migrate iscsi and raid to use stable page writes, the
complaints about high latency will likely return.  We might as well
centralize their page snapshotting thing to one place.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 17:22:20 -08:00
Tejun Heo
8c1cf6bb02 block: add @req to bio_{front|back}_merge tracepoints
bio_{front|back}_merge tracepoints report a bio merging into an
existing request but didn't specify which request the bio is being
merged into.  Add @req to it.  This makes it impossible to share the
event template with block_bio_queue - split it out.

@req isn't used or exported to userland at this point and there is no
userland visible behavior change.  Later changes will make use of the
extra parameter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-14 15:00:36 +01:00
Tejun Heo
3a366e614d block: add missing block_bio_complete() tracepoint
bio completion didn't kick block_bio_complete TP.  Only dm was
explicitly triggering the TP on IO completion.  This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.

This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.

* Explicit trace_block_bio_complete() invocation removed from dm and
  the trace point is unexported.

* @rq dropped from trace_block_bio_complete().  bios may fly around
  w/o queue associated.  Verifying and accessing the assocaited queue
  belongs to TP probes.

* blktrace now gets both request and bio completions.  Make it ignore
  bio completions if request completion path is happening.

This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.

v2: With this change, block_bio_complete TP could be invoked on sg
    commands which have bio's with %NULL bi_bdev.  Update TP
    assignment code to check whether bio->bi_bdev is %NULL before
    dereferencing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-14 15:00:36 +01:00
Jianpeng Ma
422765c263 block: Remove should_sort judgement when flush blk_plug
In commit 975927b942c932,it add blk_rq_pos to sort rq when flushing.
Although this commit was used for the situation which blk_plug handled
multi devices on the same time like md device.
I think there must be some situations like this but only single
device.
So remove the should_sort judgement.
Because the parameter should_sort is only for this purpose,it can delete
should_sort from blk_plug.

CC: Shaohua Li <shli@kernel.org>
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-11 14:46:09 +01:00
NeilBrown
cbae8d45d6 block: export block_unplug tracepoint
This allows stacked devices (like md/raid5) to provide blktrace
tracing, including unplug events.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-14 20:49:27 +01:00
Bart Van Assche
24faf6f604 block: Make blk_cleanup_queue() wait until request_fn finished
Some request_fn implementations, e.g. scsi_request_fn(), unlock
the queue lock internally. This may result in multiple threads
executing request_fn for the same queue simultaneously. Keep
track of the number of active request_fn calls and make sure that
blk_cleanup_queue() waits until all active request_fn invocations
have finished. A block driver may start cleaning up resources
needed by its request_fn as soon as blk_cleanup_queue() finished,
so blk_cleanup_queue() must wait for all outstanding request_fn
invocations to finish.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Chanho Min <chanho.min@lge.com>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:33:00 +01:00
Bart Van Assche
704605711e block: Avoid scheduling delayed work on a dead queue
Running a queue must continue after it has been marked dying until
it has been marked dead. So the function blk_run_queue_async() must
not schedule delayed work after blk_cleanup_queue() has marked a queue
dead. Hence add a test for that queue state in blk_run_queue_async()
and make sure that queue_unplugged() invokes that function with the
queue lock held. This avoids that the queue state can change after
it has been tested and before mod_delayed_work() is invoked. Drop
the queue dying test in queue_unplugged() since it is now
superfluous: __blk_run_queue() already tests whether or not the
queue is dead.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:32:30 +01:00
Bart Van Assche
c246e80d86 block: Avoid that request_fn is invoked on a dead queue
A block driver may start cleaning up resources needed by its
request_fn as soon as blk_cleanup_queue() finished, so request_fn
must not be invoked after draining finished. This is important
when blk_run_queue() is invoked without any requests in progress.
As an example, if blk_drain_queue() and scsi_run_queue() run in
parallel, blk_drain_queue() may have finished all requests after
scsi_run_queue() has taken a SCSI device off the starved list but
before that last function has had a chance to run the queue.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Chanho Min <chanho.min@lge.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:32:01 +01:00
Bart Van Assche
807592a4fa block: Let blk_drain_queue() caller obtain the queue lock
Let the caller of blk_drain_queue() obtain the queue lock to improve
readability of the patch called "Avoid that request_fn is invoked on
a dead queue".

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Chanho Min <chanho.min@lge.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:30:59 +01:00
Bart Van Assche
3f3299d5c0 block: Rename queue dead flag
QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
stop. After this flag has been set queue draining starts. However,
during the queue draining phase it is still safe to invoke the
queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
flag.

This patch has been generated by running the following command
over the kernel source tree:

git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
    xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g'      \
        -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g';                \
sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
    include/linux/blkdev.h;                                       \
sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
    -e 's/Dead queue/A dying queue/' block/blk-core.c

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Chanho Min <chanho.min@lge.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:30:58 +01:00
Ezequiel Garcia
c304a51bf4 block: use NUMA_NO_NODE instead of -1
Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>

Modified by me to cover blk_init_queue() as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-10 10:41:13 +01:00
Jianpeng Ma
975927b942 block: Add blk_rq_pos(rq) to sort rq when plushing
My workload is a raid5 which had 16 disks. And used our filesystem to
write using direct-io mode.

I used the blktrace to find those message:
8,16   0     6647     2.453665504  2579  M   W 7493152 + 8 [md0_raid5]
8,16   0     6648     2.453672411  2579  Q   W 7493160 + 8 [md0_raid5]
8,16   0     6649     2.453672606  2579  M   W 7493160 + 8 [md0_raid5]
8,16   0     6650     2.453679255  2579  Q   W 7493168 + 8 [md0_raid5]
8,16   0     6651     2.453679441  2579  M   W 7493168 + 8 [md0_raid5]
8,16   0     6652     2.453685948  2579  Q   W 7493176 + 8 [md0_raid5]
8,16   0     6653     2.453686149  2579  M   W 7493176 + 8 [md0_raid5]
8,16   0     6654     2.453693074  2579  Q   W 7493184 + 8 [md0_raid5]
8,16   0     6655     2.453693254  2579  M   W 7493184 + 8 [md0_raid5]
8,16   0     6656     2.453704290  2579  Q   W 7493192 + 8 [md0_raid5]
8,16   0     6657     2.453704482  2579  M   W 7493192 + 8 [md0_raid5]
8,16   0     6658     2.453715016  2579  Q   W 7493200 + 8 [md0_raid5]
8,16   0     6659     2.453715247  2579  M   W 7493200 + 8 [md0_raid5]
8,16   0     6660     2.453721730  2579  Q   W 7493208 + 8 [md0_raid5]
8,16   0     6661     2.453721974  2579  M   W 7493208 + 8 [md0_raid5]
8,16   0     6662     2.453728202  2579  Q   W 7493216 + 8 [md0_raid5]
8,16   0     6663     2.453728436  2579  M   W 7493216 + 8 [md0_raid5]
8,16   0     6664     2.453734782  2579  Q   W 7493224 + 8 [md0_raid5]
8,16   0     6665     2.453735019  2579  M   W 7493224 + 8 [md0_raid5]
8,16   0     6666     2.453741401  2579  Q   W 7493232 + 8 [md0_raid5]
8,16   0     6667     2.453741632  2579  M   W 7493232 + 8 [md0_raid5]
8,16   0     6668     2.453748148  2579  Q   W 7493240 + 8 [md0_raid5]
8,16   0     6669     2.453748386  2579  M   W 7493240 + 8 [md0_raid5]
8,16   0     6670     2.453851843  2579  I   W 7493144 + 104 [md0_raid5]
8,16   0        0     2.453853661     0  m   N cfq2579 insert_request
8,16   0     6671     2.453854064  2579  I   W 7493120 + 24 [md0_raid5]
8,16   0        0     2.453854439     0  m   N cfq2579 insert_request
8,16   0     6672     2.453854793  2579  U   N [md0_raid5] 2
8,16   0        0     2.453855513     0  m   N cfq2579 Not idling.st->count:1
8,16   0        0     2.453855927     0  m   N cfq2579 dispatch_insert
8,16   0        0     2.453861771     0  m   N cfq2579 dispatched a request
8,16   0        0     2.453862248     0  m   N cfq2579 activate rq,drv=1
8,16   0     6673     2.453862332  2579  D   W 7493120 + 24 [md0_raid5]
8,16   0        0     2.453865957     0  m   N cfq2579 Not idling.st->count:1
8,16   0        0     2.453866269     0  m   N cfq2579 dispatch_insert
8,16   0        0     2.453866707     0  m   N cfq2579 dispatched a request
8,16   0        0     2.453867061     0  m   N cfq2579 activate rq,drv=2
8,16   0     6674     2.453867145  2579  D   W 7493144 + 104 [md0_raid5]
8,16   0     6675     2.454147608     0  C   W 7493120 + 24 [0]
8,16   0        0     2.454149357     0  m   N cfq2579 complete rqnoidle 0
8,16   0     6676     2.454791505     0  C   W 7493144 + 104 [0]
8,16   0        0     2.454794803     0  m   N cfq2579 complete rqnoidle 0
8,16   0        0     2.454795160     0  m   N cfq schedule dispatch

From above messages,we can find rq[W 7493144 + 104] and rq[W
7493120 + 24] do not merge.
Because the bio order is:
  8,16   0     6638     2.453619407  2579  Q   W 7493144 + 8 [md0_raid5]
  8,16   0     6639     2.453620460  2579  G   W 7493144 + 8 [md0_raid5]
  8,16   0     6640     2.453639311  2579  Q   W 7493120 + 8 [md0_raid5]
  8,16   0     6641     2.453639842  2579  G   W 7493120 + 8 [md0_raid5]
The bio(7493144) first and bio(7493120) later.So the subsequent
bios will be divided into two parts.
When flushing plug-list,because elv_attempt_insert_merge only support
backmerge,not supporting frontmerge.
So rq[7493120 + 24] can't merge with rq[7493144 + 104].

From my test,i found those situation can count 25% in our system.
Using this patch, there is no this situation.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
CC:Shaohua Li <shli@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-25 21:58:17 +02:00