Commit Graph

400664 Commits

Author SHA1 Message Date
Jens Axboe
e7e2450001 blk-mq: don't disallow request merges for req->special being set
For blk-mq, if a driver has requested per-request payload data
to carry command structures, they are stuffed into req->special.
For an old style request based driver, req->special is used
for the same purpose but indicates that a per-driver request
structure has been prepared for the request already. So for the
old style driver, we do not merge such requests.

As most/all blk-mq drivers will use the payload feature, and
since we have no problem merging on these, make this check
dependent on whether it's a blk-mq enabled driver or not.

Reported-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-29 12:11:47 -06:00
Shaohua Li
92f399c72a blk-mq: mq plug list breakage
We switched to plug mq_list for mq, but some code are still using old list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-29 12:01:03 -06:00
Christoph Hellwig
3228f48be2 blk-mq: fix for flush deadlock
The flush state machine takes in a struct request, which then is
submitted multiple times to the underling driver.  The old block code
requeses the same request for each of those, so it does not have an
issue with tapping into the request pool.  The new one on the other hand
allocates a new request for each of the actualy steps of the flush
sequence. If have already allocated all of the tags for IO, we will
fail allocating the flush request.

Set aside a reserved request just for flushes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-28 13:33:58 -06:00
Christoph Hellwig
280d45f6c3 blk-mq: add blk_mq_stop_hw_queues
Add a helper to iterate over all hw queues and stop them.  This is useful
for driver that implement PM suspend functionality.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modified to just call blk_mq_stop_hw_queue() by Jens.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 14:45:58 +01:00
Jens Axboe
f2298c0403 null_blk: multi queue aware block test driver
A driver that simply completes IO it receives, it does no
transfers. Written to fascilitate testing of the blk-mq code.
It supports various module options to use either bio queueing,
rq queueing, or mq mode.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:56:00 +01:00
Jens Axboe
320ae51fee blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:

- The classic request_fn based approach, where drivers use struct
  request units for IO. The block layer provides various helper
  functionalities to let drivers share code, things like tag
  management, timeout handling, queueing, etc.

- The "stacked" approach, where a driver squeezes in between the
  block layer and IO submitter. Since this bypasses the IO stack,
  driver generally have to manage everything themselves.

With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.

The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.

This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.

blk-mq provides various helper functions, which include:

- Scalable support for request tagging. Most devices need to
  be able to uniquely identify a request both in the driver and
  to the hardware. The tagging uses per-cpu caches for freed
  tags, to enable cache hot reuse.

- Timeout handling without tracking request on a per-device
  basis. Basically the driver should be able to get a notification,
  if a request happens to fail.

- Optional support for non 1:1 mappings between issue and
  submission queues. blk-mq can redirect IO completions to the
  desired location.

- Support for per-request payloads. Drivers almost always need
  to associate a request structure with some driver private
  command structure. Drivers can tell blk-mq this at init time,
  and then any request handed to the driver will have the
  required size of memory associated with it.

- Support for merging of IO, and plugging. The stacked model
  gets neither of these. Even for high IOPS devices, merging
  sequential IO reduces per-command overhead and thus
  increases bandwidth.

For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).

Contributions in this patch from the following people:

Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:56:00 +01:00
Shaohua Li
1dddc01af0 percpu_ida: add an API to return free tags
Add an API to return free tags, blk-mq-tag will use it.

Note, this just returns a snapshot of free tags number. blk-mq-tag has
two usages of it. One is for info output for diagnosis. The other is to
quickly check if there are free tags for request dispatch checking.
Neither requires very precise.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:56:00 +01:00
Shaohua Li
7fc2ba17e8 percpu_ida: add percpu_ida_for_each_free
Add a new API to iterate free ids. blk-mq-tag will use it.

Note, this doesn't guarantee to iterate all free ids restrictly. Caller
should be aware of this. blk-mq uses it to do sanity check for request
timedout, so can tolerate the limitation.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:55:59 +01:00
Shaohua Li
e26b53d0b2 percpu_ida: make percpu_ida percpu size/batch configurable
Make percpu_ida percpu size/batch configurable. The block-mq-tag will
use it.

After block-mq uses percpu_ida to manage tags, performance is improved.
My test is done in a 2 sockets machine, 12 process cross the 2 sockets.
So if there is lock contention or ipi, should be stressed heavily.
Testing is done for null-blk.

hw_queue_depth	nopatch iops	patch iops
64		~800k/s		~1470k/s
2048		~4470k/s	~4340k/s

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:55:59 +01:00
Shaohua Li
098faf5805 percpu_counter: make APIs irq safe
In my usage, sometimes the percpu APIs are called with irq locked,
sometimes not. lockdep complains there is potential deadlock. Let's
always use percpucounter lock in irq safe way. There should be no
performance penality, as all those are slow code path.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:55:59 +01:00
Christoph Hellwig
71fe07d040 block: remove request ref_count
This reference count has been around since before git history, but the only
place where it's used is in blk_execute_rq, and ther it is entirely useless
as it is incremented before submitting the request and decremented in the
end_io handler before waking up the submitter thread.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:55:59 +01:00
Jens Axboe
5953316dbf block: make rq->cmd_flags be 64-bit
We have officially run out of flags in a 32-bit space. Extend it
to 64-bit even on 32-bit archs.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:55:59 +01:00
Jens Axboe
c84a83e2aa smp: don't warn about csd->flags having CSD_FLAG_LOCK cleared for !wait
blk-mq reuses the request potentially immediately, since the most
cache hot is always given out first. This means that rq->csd could
be reused between csd->func() being called and csd_unlock() being
called. This isn't a problem, since we never use wait == 1 for
the smp call function. Add CSD_FLAG_WAIT to be able to tell the
difference, retaining the warning for other cases.

Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:55:59 +01:00
Jens Axboe
e3daab6ce4 smp: export __smp_call_function_single()
The blk-mq core and the blk-mq null driver uses it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:55:59 +01:00
Linus Torvalds
61e6cfa80d Linux 3.12-rc5 2013-10-13 15:41:28 -07:00
Linus Torvalds
73cac03d0c Merge git://www.linux-watchdog.org/linux-watchdog
Pull watchdog fixes from Wim Van Sebroeck:
 "This will fix a deadlock on the ts72xx_wdt driver, fix bitmasks in the
  kempld_wdt driver and fix a section mismatch in the sunxi_wdt driver"

* git://www.linux-watchdog.org/linux-watchdog:
  watchdog: sunxi: Fix section mismatch
  watchdog: kempld_wdt: Fix bit mask definition
  watchdog: ts72xx_wdt: locking bug in ioctl
2013-10-13 11:41:26 -07:00
Maxime Ripard
1d5898b4f8 watchdog: sunxi: Fix section mismatch
This driver has a section mismatch, for probe and remove functions,
leading to the following warning during the compilation.

WARNING: drivers/watchdog/built-in.o(.data+0x24): Section mismatch in
reference from the variable sunxi_wdt_driver to the function
.init.text:sunxi_wdt_probe()
The variable sunxi_wdt_driver references
the function __init sunxi_wdt_probe()

Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2013-10-13 20:02:03 +02:00
Jingoo Han
4c4e45669d watchdog: kempld_wdt: Fix bit mask definition
STAGE_CFG bits are defined as [5:4] bits. However, '(((x) & 0x30) << 4)'
handles [9:8] bits. Thus, it should be fixed in order to handle
[5:4] bits.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2013-10-13 20:01:57 +02:00
Dan Carpenter
8612ed0d97 watchdog: ts72xx_wdt: locking bug in ioctl
Calling the WDIOC_GETSTATUS & WDIOC_GETBOOTSTATUS and twice will cause a
interruptible deadlock.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2013-10-13 20:01:50 +02:00
Linus Torvalds
3552570a21 Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
 "A small batch of fixes this week, mostly OMAP related.  Nothing stands
  out as particularly controversial.

  Also a fix for a 3.12-rc1 timer regression for Exynos platforms,
  including the Chromebooks"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  ARM: exynos: dts: Update 5250 arch timer node with clock frequency
  ARM: OMAP2: RX-51: Add missing max_current to rx51_lp5523_led_config
  ARM: mach-omap2: board-generic: fix undefined symbol
  ARM: dts: Fix pinctrl mask for omap3
  ARM: OMAP3: Fix hardware detection for omap3630 when booted with device tree
  ARM: OMAP2: gpmc-onenand: fix sync mode setup with DT
2013-10-13 09:59:10 -07:00
Yuvaraj Kumar C D
4d594dd302 ARM: exynos: dts: Update 5250 arch timer node with clock frequency
Without the "clock-frequency" property in arch timer node, could able
to see the below crash dump.

[<c0014e28>] (unwind_backtrace+0x0/0xf4) from [<c0011808>] (show_stack+0x10/0x14)
[<c0011808>] (show_stack+0x10/0x14) from [<c036ac1c>] (dump_stack+0x7c/0xb0)
[<c036ac1c>] (dump_stack+0x7c/0xb0) from [<c01ab760>] (Ldiv0_64+0x8/0x18)
[<c01ab760>] (Ldiv0_64+0x8/0x18) from [<c0062f60>] (clockevents_config.part.2+0x1c/0x74)
[<c0062f60>] (clockevents_config.part.2+0x1c/0x74) from [<c0062fd8>] (clockevents_config_and_register+0x20/0x2c)
[<c0062fd8>] (clockevents_config_and_register+0x20/0x2c) from [<c02b8e8c>] (arch_timer_setup+0xa8/0x134)
[<c02b8e8c>] (arch_timer_setup+0xa8/0x134) from [<c04b47b4>] (arch_timer_init+0x1f4/0x24c)
[<c04b47b4>] (arch_timer_init+0x1f4/0x24c) from [<c04b40d8>] (clocksource_of_init+0x34/0x58)
[<c04b40d8>] (clocksource_of_init+0x34/0x58) from [<c049ed8c>] (time_init+0x20/0x2c)
[<c049ed8c>] (time_init+0x20/0x2c) from [<c049b95c>] (start_kernel+0x1e0/0x39c)

THis is because the Exynos u-boot, for example on the Chromebooks, doesn't set
up the CNTFRQ register as expected by arch_timer. Instead, we have to specify
the frequency in the device tree like this.

Signed-off-by: Yuvaraj Kumar C D <yuvaraj.cd@samsung.com>
[olof: Changed subject, added comment, elaborated on commit message]
Signed-off-by: Olof Johansson <olof@lixom.net>
2013-10-13 09:33:54 -07:00
Olof Johansson
98ead6e001 Merge tag 'fixes-against-v3.12-rc3-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes
From Tony Lindgren:

Few fixes for omap3 related hangs and errors that people have
noticed now that people are actually using the device tree
based booting for omap3.

Also one regression fix for timer compile for dra7xx when
omap5 is not selected, and a LED regression fix for n900.

* tag 'fixes-against-v3.12-rc3-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: OMAP2: RX-51: Add missing max_current to rx51_lp5523_led_config
  ARM: mach-omap2: board-generic: fix undefined symbol
  ARM: dts: Fix pinctrl mask for omap3
  ARM: OMAP3: Fix hardware detection for omap3630 when booted with device tree
  ARM: OMAP2: gpmc-onenand: fix sync mode setup with DT

Signed-off-by: Olof Johansson <olof@lixom.net>
2013-10-13 09:33:32 -07:00
Linus Torvalds
2d4712b7a6 Merge branch 'parisc-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
 "This patchset includes a bugfix to prevent a kernel crash when memory
  in page zero is accessed by the kernel itself, e.g.  via
  probe_kernel_read().

  Furthermore we now export flush_cache_page() which is needed
  (indirectly) by the lustre filesystem.  The other patches remove
  unused functions and optimizes the page fault handler to only evaluate
  variables if needed, which again protects against possible kernel
  crashes"

* 'parisc-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: let probe_kernel_read() capture access to page zero
  parisc: optimize variable initialization in do_page_fault
  parisc: fix interruption handler to respect pagefault_disable()
  parisc: mark parisc_terminate() noreturn and cold.
  parisc: remove unused syscall_ipi() function.
  parisc: kill SMP single function call interrupt
  parisc: Export flush_cache_page() (needed by lustre)
2013-10-13 09:13:28 -07:00
Linus Torvalds
75c531881b Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma
Pull slave-dmaengine fixes from Vinod Koul:
 "Another week, time to send another fixes request taking time out of
  extended weekend for the festivities in this part of the world.

  We have two fixes from Sergei for rcar driver and one fixing memory
  leak of edma driver by Geyslan"

* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
  dma: edma.c: remove edma_desc leakage
  rcar-hpbdma: add parameter to set_slave() method
  rcar-hpbdma: remove shdma_free_irq() calls
2013-10-13 09:02:03 -07:00
Helge Deller
db080f9c53 parisc: let probe_kernel_read() capture access to page zero
Signed-off-by: Helge Deller <deller@gmx.de>
2013-10-13 17:46:31 +02:00