Commit Graph

144 Commits

Author SHA1 Message Date
Martin K. Petersen
e03a72e136 block: Stop using byte offsets
All callers of the stacking functions use 512-byte sector units rather
than byte offsets.  Simplify the code so the stacking functions take
sectors when specifying data offsets.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-01-11 14:30:09 +01:00
Karel Zak
87038c2d5b partitions: read whole sector with EFI GPT header
The size of EFI GPT header is not static, but whole sector is
allocated for the header. The HeaderSize field must be greater
than 92 (= sizeof(struct gpt_header) and must be less than or
equal to the logical block size.

It means we have to read whole sector with the header, because the
header crc32 checksum is calculated according to HeaderSize.

For more details see UEFI standard (version 2.3, May 2009):
  - 5.3.1 GUID Format overview, page 93
  - Table 13. GUID Partition Table Header, page 96

Signed-off-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-11-23 09:29:58 +01:00
Karel Zak
7d13af3279 partitions: use sector size for EFI GPT
Currently, kernel uses strictly 512-byte sectors for EFI GPT parsing.
That's wrong.

UEFI standard (version 2.3, May 2009, 5.3.1 GUID Format overview, page
95) defines that LBA is always based on the logical block size. It
means bdev_logical_block_size() (aka BLKSSZGET) for Linux.

This patch removes static sector size from EFI GPT parser.

The problem is reproducible with the latest GNU Parted:

 # modprobe scsi_debug dev_size_mb=50 sector_size=4096

  # ./parted /dev/sdb print
  Model: Linux scsi_debug (scsi)
  Disk /dev/sdb: 52.4MB
  Sector size (logical/physical): 4096B/4096B
  Partition Table: gpt

  Number  Start   End     Size    File system  Name     Flags
   1      24.6kB  3002kB  2978kB               primary
   2      3002kB  6001kB  2998kB               primary
   3      6001kB  9003kB  3002kB               primary

  # blockdev --rereadpt /dev/sdb
  # dmesg | tail -1
   sdb: unknown partition table      <---- !!!

with this patch:

  # blockdev --rereadpt /dev/sdb
  # dmesg | tail -1
   sdb: sdb1 sdb2 sdb3

Signed-off-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-11-23 09:29:13 +01:00
Martin K. Petersen
86b3728141 block: Expose discard granularity
While SSDs track block usage on a per-sector basis, RAID arrays often
have allocation blocks that are bigger.  Allow the discard granularity
and alignment to be set and teach the topology stacking logic how to
handle them.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-11-10 11:50:21 +01:00
Nikanth Karthikesan
316d315bff block: Seperate read and write statistics of in_flight requests v2
Commit a9327cac44 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.

But  Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.

So this was reverted by commit 0f78ab9899

The problem was in part_round_stats_single(), I missed the following:
        if (now == part->stamp)
                return;

-       if (part->in_flight) {
+       if (part_in_flight(part)) {
                __part_stat_add(cpu, part, time_in_queue,
                                part_in_flight(part) * (now - part->stamp));
                __part_stat_add(cpu, part, io_ticks, (now - part->stamp));

With this chunk included, the reported regression gets fixed.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>

--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-06 20:16:55 +02:00
Jens Axboe
0f78ab9899 Revert "Seperate read and write statistics of in_flight requests"
This reverts commit a9327cac44.

Corrado Zoccolo <czoccolo@gmail.com> reports:

"with 2.6.32-rc1 I started getting the following strange output from
"iostat -kx 2":
Linux 2.6.31bisect (et2) 	04/10/2009 	_i686_	(2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10,70    0,00    3,16   15,75    0,00   70,38

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda              18,22     0,00    0,67    0,01    14,77     0,02
43,94     0,01   10,53 39043915,03 2629219,87
sdb              60,89     9,68   50,79    3,04  1724,43    50,52
65,95     0,70   13,06 488437,47 2629219,87

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2,72    0,00    0,74    0,00    0,00   96,53

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00
sdb               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6,68    0,00    0,99    0,00    0,00   92,33

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00
sdb               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4,40    0,00    0,73    1,47    0,00   93,40

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00
sdb               0,00     4,00    0,00    3,00     0,00    28,00
18,67     0,06   19,50 333,33 100,00

Global values for service time and utilization are garbage. For
interval values, utilization is always 100%, and service time is
higher than normal.

I bisected it down to:
[a9327cac44] Seperate read and write
statistics of in_flight requests
and verified that reverting just that commit indeed solves the issue
on 2.6.32-rc1."

So until this is debugged, revert the bad commit.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-04 21:04:38 +02:00
Alexey Dobriyan
83d5cde47d const: make block_device_operations const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:25 -07:00
David Brownell
a4dbd6740d driver model: constify attribute groups
Let attribute group vectors be declared "const".  We'd
like to let most attribute metadata live in read-only
sections... this is a start.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-15 09:50:47 -07:00
Nikanth Karthikesan
a9327cac44 Seperate read and write statistics of in_flight requests
Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.

This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-14 08:24:52 +02:00
Heiko Carstens
f8c73c790c partitions: fix broken uevent_suppress conversion
git commit f67f129e "Driver core: implement uevent suppress in kobject"
contains this chunk for fs/partitions/check.c:

 	/* suppress uevent if the disk supresses it */
-	if (!ddev->uevent_suppress)
+	if (!dev_get_uevent_suppress(pdev))
 		kobject_uevent(&pdev->kobj, KOBJ_ADD);

However that should have been

-	if (!ddev->uevent_suppress)
+	if (!dev_get_uevent_suppress(ddev))

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Ming Lei <tom.leiming@gmail.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-12 13:02:09 -07:00
Linus Torvalds
d614aec475 Merge branch 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6
* 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (29 commits)
  ide: re-implement ide_pci_init_one() on top of ide_pci_init_two()
  ide: unexport ide_find_dma_mode()
  ide: fix PowerMac bootup oops
  ide: skip probe if there are no devices on the port (v2)
  sl82c105: add printk() logging facility
  ide-tape: fix proc warning
  ide: add IDE_DFLAG_NIEN_QUIRK device flag
  ide: respect quirk_drives[] list on all controllers
  hpt366: enable all quirks for devices on quirk_drives[] list
  hpt366: sync quirk_drives[] list with pdc202xx_{new,old}.c
  ide: remove superfluous SELECT_MASK() call from do_rw_taskfile()
  ide: remove superfluous SELECT_MASK() call from ide_driveid_update()
  icside: remove superfluous ->maskproc method
  ide-tape: fix IDE_AFLAG_* atomic accesses
  ide-tape: change IDE_AFLAG_IGNORE_DSC non-atomically
  pdc202xx_old: kill resetproc() method
  pdc202xx_old: don't call pdc202xx_reset() on IRQ timeout
  pdc202xx_old: use ide_dma_test_irq()
  ide: preserve Host Protected Area by default (v2)
  ide-gd: implement block device ->set_capacity method (v2)
  ...
2009-06-12 09:29:42 -07:00
Bartlomiej Zolnierkiewicz
db429e9ec0 partitions: add ->set_capacity block device method
* Add ->set_capacity block device method and use it in rescan_partitions()
  to attempt enabling native capacity of the device upon detecting the
  partition which exceeds device capacity.

* Add GENHD_FL_NATIVE_CAPACITY flag to try limit attempts of enabling
  native capacity during partition scan.

Together with the consecutive patch implementing ->set_capacity method in
ide-gd device driver this allows automatic disabling of Host Protected Area
(HPA) if any partitions overlapping HPA are detected.

Cc: Robert Hancock <hancockrwd@gmail.com>
Cc: Frans Pop <elendil@planet.nl>
Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Emphatically-Acked-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-06-07 13:52:52 +02:00
Bartlomiej Zolnierkiewicz
02c33b123e partitions: warn about the partition exceeding device capacity
The current warning message says only about the kernel's action taken
without mentioning the underlying reason behind it.

Noticed-by: Robert Hancock <hancockrwd@gmail.com>
Cc: Frans Pop <elendil@planet.nl>
Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Emphatically-Acked-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-06-07 13:52:51 +02:00
Martin K. Petersen
c72758f337 block: Export I/O topology for block devices and partitions
To support devices with physical block sizes bigger than 512 bytes we
need to ensure proper alignment.  This patch adds support for exposing
I/O topology characteristics as devices are stacked.

  logical_block_size is the smallest unit the device can address.

  physical_block_size indicates the smallest I/O the device can write
  without incurring a read-modify-write penalty.

  The io_min parameter is the smallest preferred I/O size reported by
  the device.  In many cases this is the same as the physical block
  size.  However, the io_min parameter can be scaled up when stacking
  (RAID5 chunk size > physical block size).

  The io_opt characteristic indicates the optimal I/O size reported by
  the device.  This is usually the stripe width for arrays.

  The alignment_offset parameter indicates the number of bytes the start
  of the device/partition is offset from the device's natural alignment.
  Partition tools and MD/DM utilities can use this to pad their offsets
  so filesystems start on proper boundaries.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:55 +02:00
Martin K. Petersen
e1defc4ff0 block: Do away with the notion of hardsect_size
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case.  The
sector size will be 4KB but the logical block size will remain
512-bytes.  Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Ingo Molnar
8302294f43 Merge branch 'tracing/core-v2' into tracing-for-linus
Conflicts:
	include/linux/slub_def.h
	lib/Kconfig.debug
	mm/slob.c
	mm/slub.c
2009-04-02 00:49:02 +02:00
Linus Torvalds
21cdbc1378 Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (81 commits)
  [S390] remove duplicated #includes
  [S390] cpumask: use mm_cpumask() wrapper
  [S390] cpumask: Use accessors code.
  [S390] cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits.
  [S390] cpumask: remove cpu_coregroup_map
  [S390] fix clock comparator save area usage
  [S390] Add hwcap flag for the etf3 enhancement facility
  [S390] Ensure that ipl panic notifier is called late.
  [S390] fix dfp elf hwcap/facility bit detection
  [S390] smp: perform initial cpu reset before starting a cpu
  [S390] smp: fix memory leak on __cpu_up
  [S390] ipl: Improve checking logic and remove switch defaults.
  [S390] s390dbf: Remove needless check for NULL pointer.
  [S390] s390dbf: Remove redundant initilizations.
  [S390] use kzfree()
  [S390] BUG to BUG_ON changes
  [S390] zfcpdump: Prevent zcore from beeing built as a kernel module.
  [S390] Use csum_partial in checksum.h
  [S390] cleanup lowcore.h
  [S390] eliminate ipl_device from lowcore
  ...
2009-03-26 16:04:22 -07:00
Stefan Weinhuber
b44b0ab3ba [S390] dasd: add large volume support
The dasd device driver will now support ECKD devices with more then
65520 cylinders.
In the traditional ECKD adressing scheme each track is addressed
by a 16-bit cylinder and 16-bit head number. The new addressing
scheme makes use of the fact that the actual number of heads is
never larger then 15, so 12 bits of the head number can be redefined
to be part of the cylinder address.

Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-03-26 15:24:05 +01:00
Ming Lei
f67f129e51 Driver core: implement uevent suppress in kobject
This patch implements uevent suppress in kobject and removes it
from struct device, based on the following ideas:

1,Uevent sending should be one attribute of kobject, so suppressing it
in kobject layer is more natural than in device layer. By this way,
we can do it for other objects embedded with kobject.

2,It may save several bytes for each instance of struct device.(On my
omap3(32bit ARM) based box, can save 8bytes per device object)

This patch also introduces dev_set|get_uevent_suppress() helpers to
set and query uevent_suppress attribute in case to help kobject
as private part of struct device in future.

[This version is against the latest driver-core patch set of Greg,please
ignore the last version.]

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-24 16:38:26 -07:00
Arnaldo Carvalho de Melo
157f9c00e8 tracing/blktrace: fix up checkpatch reported problems in ftrace plugin patch
Also make sure sparse (make C=2 block/blktrace.o) is happy too.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-26 18:30:01 +01:00
Arnaldo Carvalho de Melo
c71a896154 blktrace: add ftrace plugin
Impact: New way of using the blktrace infrastructure

This drops the requirement of userspace utilities to use the blktrace
facility.

Configuration is done thru sysfs, adding a "trace" directory to the
partition directory where blktrace can be enabled for the associated
request_queue.

The same filters present in the IOCTL interface are present as sysfs
device attributes.

The /sys/block/sdX/sdXN/trace/enable file allows tracing without any
filters.

The other files in this directory: pid, act_mask, start_lba and end_lba
can be used with the same meaning as with the IOCTL interface.

Using the sysfs interface will only setup the request_queue->blk_trace
fields, tracing will only take place when the "blk" tracer is selected
via the ftrace interface, as in the following example:

To see the trace, one can use the /d/tracing/trace file or the
/d/tracign/trace_pipe file, with semantics defined in the ftrace
documentation in Documentation/ftrace.txt.

[root@f10-1 ~]# cat /t/trace
       kjournald-305   [000]  3046.491224:   8,1    A WBS 6367 + 8 <- (8,1) 6304
       kjournald-305   [000]  3046.491227:   8,1    Q   R 6367 + 8 [kjournald]
       kjournald-305   [000]  3046.491236:   8,1    G  RB 6367 + 8 [kjournald]
       kjournald-305   [000]  3046.491239:   8,1    P  NS [kjournald]
       kjournald-305   [000]  3046.491242:   8,1    I RBS 6367 + 8 [kjournald]
       kjournald-305   [000]  3046.491251:   8,1    D  WB 6367 + 8 [kjournald]
       kjournald-305   [000]  3046.491610:   8,1    U  WS [kjournald] 1
          <idle>-0     [000]  3046.511914:   8,1    C  RS 6367 + 8 [6367]
[root@f10-1 ~]#

The default line context (prefix) format is the one described in the ftrace
documentation, with the blktrace specific bits using its existing format,
described in blkparse(8).

If one wants to have the classic blktrace formatting, this is possible by
using:

[root@f10-1 ~]# echo blk_classic > /t/trace_options
[root@f10-1 ~]# cat /t/trace
  8,1    0  3046.491224   305  A WBS 6367 + 8 <- (8,1) 6304
  8,1    0  3046.491227   305  Q   R 6367 + 8 [kjournald]
  8,1    0  3046.491236   305  G  RB 6367 + 8 [kjournald]
  8,1    0  3046.491239   305  P  NS [kjournald]
  8,1    0  3046.491242   305  I RBS 6367 + 8 [kjournald]
  8,1    0  3046.491251   305  D  WB 6367 + 8 [kjournald]
  8,1    0  3046.491610   305  U  WS [kjournald] 1
  8,1    0  3046.511914     0  C  RS 6367 + 8 [6367]
[root@f10-1 ~]#

Using the ftrace standard format allows more flexibility, such
as the ability of asking for backtraces via trace_options:

[root@f10-1 ~]# echo noblk_classic > /t/trace_options
[root@f10-1 ~]# echo stacktrace > /t/trace_options

[root@f10-1 ~]# cat /t/trace
       kjournald-305   [000]  3318.826779:   8,1    A WBS 6375 + 8 <- (8,1) 6312
       kjournald-305   [000]  3318.826782:
 <= submit_bio
 <= submit_bh
 <= sync_dirty_buffer
 <= journal_commit_transaction
 <= kjournald
 <= kthread
 <= child_rip
       kjournald-305   [000]  3318.826836:   8,1    Q   R 6375 + 8 [kjournald]
       kjournald-305   [000]  3318.826837:
 <= generic_make_request
 <= submit_bio
 <= submit_bh
 <= sync_dirty_buffer
 <= journal_commit_transaction
 <= kjournald
 <= kthread

Please read the ftrace documentation to use aditional, standardized
tracing filters such as /d/tracing/trace_cpumask, etc.

See also /d/tracing/trace_mark to add comments in the trace stream,
that is equivalent to the /d/block/sdaN/msg interface.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-26 14:40:53 +01:00
Neil Brown
54b0d12769 block: fix bug in ptbl lookup cache
Neil writes:

   Hi Jens,

    I've found a little bug for you.  It was introduced by
        a6f23657d3

        block: add one-hit cache for disk partition lookup

    and has the effect of killing my machine whenever I try to assemble
    an md array :-(
    One of the devices in the array has partitions, and mdadm always
    deletes partitions before putting a whole-device in an array (as it
    can cause confusion).  The next IO to that device locks the machine.
    I don't really understand exactly why it locks up, but it happens in
    disk_map_sector_rcu().  This patch fixes it.

Which is due to a missing clear of the (now) stale partition lookup
data. So clear that when we delete a partition.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-01-09 21:46:13 +01:00
Kay Sievers
3ada8b7e98 block: struct device - replace bus_id with dev_name(), dev_set_name()
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-06 10:44:43 -08:00
Tejun Heo
55e8e30c38 block/md: fix md autodetection
Block ext devt conversion missed md_autodetect_dev() call in
rescan_partitions() leaving md autodetect unable to see partitions.
Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-18 15:08:56 +01:00
Tejun Heo
ba32929a91 block: make add_partition() return pointer to hd_struct
Make add_partition() return pointer to the new hd_struct on success
and ERR_PTR() value on failure.  This change will be used to fix md
autodetection bug.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-18 15:08:56 +01:00