We can't use krefs since it's apparently restricted to very basic
reference counting.
This reverts commit e4a683c8.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.
| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------
The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.
Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There's no reason for register_disk() and del_gendisk() to be in
fs/partitions/check.c. Move both to genhd.c. While at it, collapse
unlink_gendisk(), which was artificially in a separate function due to
genhd.c / check.c split, into del_gendisk().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This reverts commit 7681bfeecc.
Conflicts:
include/linux/genhd.h
It has numerous issues with the cleanup path and non-elevator
devices. Revert it for now so we can come up with a clean
version without rushing things.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
cfq-iosched: Fix a gcc 4.5 warning and put some comments
block: Turn bvec_k{un,}map_irq() into static inline functions
block: fix accounting bug on cross partition merges
block: Make the integrity mapped property a bio flag
block: Fix double free in blk_integrity_unregister
block: Ensure physical block size is unsigned int
blkio-throttle: Fix possible multiplication overflow in iops calculations
blkio-throttle: limit max iops value to UINT_MAX
blkio-throttle: There is no need to convert jiffies to milli seconds
blkio-throttle: Fix link failure failure on i386
blkio: Recalculate the throttled bio dispatch time upon throttle limit change
blkio: Add root group to td->tg_list
blkio: deletion of a cgroup was causes oops
blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: revert bad fix for memory hotplug causing bounces
Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: Prevent hang_check firing during long I/O
cfq: improve fsync performance for small files
...
Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.
| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------
The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.
When reloading partition tables, quiesce IO to ensure that no
request references to the partition struct exists. When it is safe
to free the partition table, the IO for that device is restarted
again.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
I'm reposting this patch series as v4 since there have been no additional
comments, and I cleaned up one extra bit of unneeded code (in 3/3). The patches
are against Linus's tree: 2bfc96a127
(2.6.36-rc3).
Would this patchset be suitable for inclusion in an mm branch?
This changes adds a partition_meta_info struct which itself contains a
union of structures that provide partition table specific metadata.
This change leaves the union empty. The subsequent patch includes an
implementation for CONFIG_EFI_PARTITION-based metadata.
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This flag is not used, so best discarded.
Signed-off-by: NeilBrown <neilb@suse.de>
--
Hi Jens,
I came across this recently - these are the only two occurances
of "GENHD_FL_DRIVERFS" in the kernel, so it cannot be needed.
NeilBrown
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This fixes the sparse warning:
fs/ext4/super.c:2390:40: warning: symbol 'i' shadows an earlier one
fs/ext4/super.c:2368:22: originally declared here
Using 'i' in a macro is dubious practice.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
While SSDs track block usage on a per-sector basis, RAID arrays often
have allocation blocks that are bigger. Allow the discard granularity
and alignment to be set and teach the topology stacking logic how to
handle them.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Commit a9327cac44 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This allows subsytems to provide devtmpfs with non-default permissions
for the device node. Instead of the default mode of 0600, null, zero,
random, urandom, full, tty, ptmx now have a mode of 0666, which allows
non-privileged processes to access standard device nodes in case no
other userspace process applies the expected permissions.
This also fixes a wrong assignment in pktcdvd and a checkpatch.pl complain.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.
This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This adds support for block drivers to report their requested nodename
to userspace. It also updates a number of block drivers to provide the
needed subdirectory and device name to be used for them.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (29 commits)
ide: re-implement ide_pci_init_one() on top of ide_pci_init_two()
ide: unexport ide_find_dma_mode()
ide: fix PowerMac bootup oops
ide: skip probe if there are no devices on the port (v2)
sl82c105: add printk() logging facility
ide-tape: fix proc warning
ide: add IDE_DFLAG_NIEN_QUIRK device flag
ide: respect quirk_drives[] list on all controllers
hpt366: enable all quirks for devices on quirk_drives[] list
hpt366: sync quirk_drives[] list with pdc202xx_{new,old}.c
ide: remove superfluous SELECT_MASK() call from do_rw_taskfile()
ide: remove superfluous SELECT_MASK() call from ide_driveid_update()
icside: remove superfluous ->maskproc method
ide-tape: fix IDE_AFLAG_* atomic accesses
ide-tape: change IDE_AFLAG_IGNORE_DSC non-atomically
pdc202xx_old: kill resetproc() method
pdc202xx_old: don't call pdc202xx_reset() on IRQ timeout
pdc202xx_old: use ide_dma_test_irq()
ide: preserve Host Protected Area by default (v2)
ide-gd: implement block device ->set_capacity method (v2)
...
* Add ->set_capacity block device method and use it in rescan_partitions()
to attempt enabling native capacity of the device upon detecting the
partition which exceeds device capacity.
* Add GENHD_FL_NATIVE_CAPACITY flag to try limit attempts of enabling
native capacity during partition scan.
Together with the consecutive patch implementing ->set_capacity method in
ide-gd device driver this allows automatic disabling of Host Protected Area
(HPA) if any partitions overlapping HPA are detected.
Cc: Robert Hancock <hancockrwd@gmail.com>
Cc: Frans Pop <elendil@planet.nl>
Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Emphatically-Acked-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
To support devices with physical block sizes bigger than 512 bytes we
need to ensure proper alignment. This patch adds support for exposing
I/O topology characteristics as devices are stacked.
logical_block_size is the smallest unit the device can address.
physical_block_size indicates the smallest I/O the device can write
without incurring a read-modify-write penalty.
The io_min parameter is the smallest preferred I/O size reported by
the device. In many cases this is the same as the physical block
size. However, the io_min parameter can be scaled up when stacking
(RAID5 chunk size > physical block size).
The io_opt characteristic indicates the optimal I/O size reported by
the device. This is usually the stripe width for arrays.
The alignment_offset parameter indicates the number of bytes the start
of the device/partition is offset from the device's natural alignment.
Partition tools and MD/DM utilities can use this to pad their offsets
so filesystems start on proper boundaries.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
/proc/diskstats used to show stats for all disks whether they're
zero-sized or not and their non-zero partitions. Commit
074a7aca7a accidentally changed the
behavior such that it doesn't print out zero sized disks. This patch
implements DISK_PITER_INCL_EMPTY_PART0 flag to partition iterator and
uses it in diskstats_show() such that empty part0 is shown in
/proc/diskstats.
Reported and bisectd by Dianel Collins.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Collins <solemnwarning@solemnwarning.no-ip.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>