Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:
- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.
- A few cleanups from Akinobu Mita for drbd and cciss.
- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.
- A few fixes for xen back/front block driver from Roger Pau Monne.
- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."
* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...
Currently there is not limitation of number of requests in the loop bio
list. This can lead into some nasty situations when the caller spawns
tons of bio requests taking huge amount of memory. This is even more
obvious with discard where blkdev_issue_discard() will submit all bios
for the range and wait for them to finish afterwards. On really big loop
devices and slow backing file system this can lead to OOM situation as
reported by Dave Chinner.
With this patch we will wait in loop_make_request() if the number of
bios in the loop bio list would exceed 'nr_congestion_on'.
We'll wake up the process as we process the bios form the list. Some
threshold hysteresis is in place to avoid high frequency oscillation.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
xfstests has always had random failures of tests due to loop devices
failing to be torn down and hence leaving filesytems that cannot be
unmounted. This causes test runs to immediately stop.
Over the past 6 or 7 years we've added hacks like explicit unmount
-d commands for loop mounts, losetup -d after unmount -d fails, etc,
but still the problems persist. Recently, the frequency of loop
related failures increased again to the point that xfstests 259 will
reliably fail with a stray loop device that was not torn down.
That is despite the fact the test is above as simple as it gets -
loop 5 or 6 times running mkfs.xfs with different paramters:
lofile=$(losetup -f)
losetup $lofile "$testfile"
"$MKFS_XFS_PROG" -b size=512 $lofile >/dev/null || echo "mkfs failed!"
sync
losetup -d $lofile
And losteup -d $lofile is failing with EBUSY on 1-3 of these loops
every time the test is run.
Turns out that blkid is running simultaneously with losetup -d, and
so it sees an elevated reference count and returns EBUSY. But why
is blkid running? It's obvious, isn't it? udev has decided to try
and find out what is on the block device as a result of a creation
notification. And it is racing with mkfs, so might still be scanning
the device when mkfs finishes and we try to tear it down.
So, make losetup -d force autoremove behaviour. That is, when the
last reference goes away, tear down the device. xfstests wants it
*gone*, not causing random teardown failures when we know that all
the operations the tests have specifically run on the device have
completed and are no longer referencing the loop device.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The idr_pre_get() function never returns a value < 0. It returns 0 (no
memory) or 1 (OK).
Reported-by: Silva Paulo <psdasilva@yahoo.com>
[ Rewrote Silva's patch, but attributing it to Silva anyway - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 8268f5a741 ("deny partial write for loop dev fd") tried to fix the
loop device partial read information leak problem. But it changed the
semantics of read behavior. When we read beyond the end of the device we
should get 0 bytes, which is normal behavior, we should not just return
-EIO
Instead of returning -EIO, zero out the bio to avoid information leak in
case of partail read.
Signed-off-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Dmitry Monakhov <dmonakhov@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.
Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
discard_alignment is not relevant to the loop driver since it is
supposed to be set as a workaround for the old sector 63 alignments.
So set it to zero rather than block size.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The loop driver does not support discard if encryption is enabled,
fix the comment.
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
1) Anyone who has read access to loopdev has permission to call set_status
and may change important parameters such as lo_offset, lo_sizelimit and
so on, which contradicts to read access pattern and definitely equals
to write access pattern.
2) Add lo_offset over i_size check to prevent blkdev_size overflow.
##Testcase_bagin
#dd if=/dev/zero of=./file bs=1k count=1
#losetup /dev/loop0 ./file
/* userspace_application */
struct loop_info64 loinf;
fd = open("/dev/loop0", O_RDONLY);
ioctl(fd, LOOP_GET_STATUS64, &loinf);
/* Set offset to any value which is bigger than i_size, and sizelimit
* to nonzero value*/
loinf.lo_offset = 4096*1024;
loinf.lo_sizelimit = 1024;
ioctl(fd, LOOP_SET_STATUS64, &loinf);
/* After this loop device will have size similar to 0x7fffffffffxxxx */
#blockdev --getsz /dev/loop0
##OUTPUT: 36028797018955968
##Testcase_end
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If read was not fully successful we have to fail whole bio to prevent
information leak of old pages
##Testcase_begin
dd if=/dev/zero of=./file bs=1M count=1
losetup /dev/loop0 ./file -o 4096
truncate -s 0 ./file
# OOps loop offset is now beyond i_size, so read will silently fail.
# So bio's pages would not be cleared, may which result in information leak.
hexdump -C /dev/loop0
##testcase_end
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
virtio-blk: use ida to allocate disk index
hpsa: add small delay when using PCI Power Management to reset for kump
cciss: add small delay when using PCI Power Management to reset for kump
xen/blkback: Fix two races in the handling of barrier requests.
xen/blkback: Check for proper operation.
xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
xen/blkback: Report VBD_WSECT (wr_sect) properly.
xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
xen-blkfront: plug device number leak in xlblk_init() error path
xen-blkfront: If no barrier or flush is supported, use invalid operation.
xen-blkback: use kzalloc() in favor of kmalloc()+memset()
xen-blkback: fixed indentation and comments
xen-blkfront: fix a deadlock while handling discard response
xen-blkfront: Handle discard requests.
xen-blkback: Implement discard requests ('feature-discard')
xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
drivers/block/loop.c: emit uevent on auto release
drivers/block/cpqarray.c: use pci_dev->revision
loop: always allow userspace partitions and optionally support automatic scanning
...
Fic up trivial header file includsion conflict in drivers/block/loop.c
Currently the loop device tries to call directly into write_begin/write_end
instead of going through ->write if it can. This is a fairly nasty shortcut
as write_begin and write_end are only callbacks for the generic write code
and expect to be called with filesystem specific locks held.
This code currently causes various issues for clustered filesystems as it
doesn't take the required cluster locks, and it also causes issues for XFS
as it doesn't properly lock against the swapext ioctl as called by the
defragmentation tools. This in case causes data corruption if
defragmentation hits a busy loop device in the wrong time window, as
reported by RH QA.
The reason why we have this shortcut is that it saves a data copy when
doing a transformation on the loop device, which is the technical term
for using cryptoloop (or an XOR transformation). Given that cryptoloop
has been deprecated in favour of dm-crypt my opinion is that we should
simply drop this shortcut instead of finding complicated ways to to
introduce a formal interface for this shortcut.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the loop device is associated (lo->lo_state == Lo_bound), it will have
a valid bdev pointed to by lo->lo_device. There is no reason to ever pass
an additional block_device pointer.
Signed-off-by: Ayan George <ayan.george@canonical.com>
Cc: Phillip Susi <psusi@cfl.rr.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The loopback driver failed to emit the change uevent when auto releasing
the device. Fixed lo_release() to pass the bdev to loop_clr_fd() so it
can emit the event.
Signed-off-by: Phillip Susi <psusi@cfl.rr.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ayan George <ayan@ayan.net>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.
Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Automatic partition scanning can be requested individually per loop
device during its setup by setting LO_FLAGS_PARTSCAN. By default, no
partition tables are scanned.
Userspace can now always add and remove partitions from all loop
devices, regardless if the in-kernel partition scanner is enabled or
not.
The needed partition minor numbers are allocated from the extended
minors space, the main loop device numbers will continue to match the
loop minors, regardless of the number of partitions used.
# grep . /sys/class/block/loop1/loop/*
/sys/block/loop1/loop/autoclear:0
/sys/block/loop1/loop/backing_file:/home/kay/data/stuff/part.img
/sys/block/loop1/loop/offset:0
/sys/block/loop1/loop/partscan:1
/sys/block/loop1/loop/sizelimit:0
# ls -l /dev/loop*
brw-rw---- 1 root disk 7, 0 Aug 14 20:22 /dev/loop0
brw-rw---- 1 root disk 7, 1 Aug 14 20:23 /dev/loop1
brw-rw---- 1 root disk 259, 0 Aug 14 20:23 /dev/loop1p1
brw-rw---- 1 root disk 259, 1 Aug 14 20:23 /dev/loop1p2
brw-rw---- 1 root disk 7, 99 Aug 14 20:23 /dev/loop99
brw-rw---- 1 root disk 259, 2 Aug 14 20:23 /dev/loop99p1
brw-rw---- 1 root disk 259, 3 Aug 14 20:23 /dev/loop99p2
crw------T 1 root root 10, 237 Aug 14 20:22 /dev/loop-control
Cc: Karel Zak <kzak@redhat.com>
Cc: Davidlohr Bueso <dave@gnu.org>
Acked-By: Tejun Heo <tj@kernel.org>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This commit adds discard support for loop devices. Discard is usually
supported by SSD and thinly provisioned devices as a method for
reclaiming unused space. This is no different than trying to reclaim
back space which is not used by the file system on the image, but it
still occupies space on the host file system.
We can do the reclamation on file system which does support hole
punching. So when discard request gets to the loop driver we can
translate that to punch a hole to the underlying file, hence reclaim
the free space.
This is very useful for trimming down the size of the image to only what
is really used by the file system on that image. Fstrim may be used for
that purpose.
It has been tested on ext4, xfs and btrfs with the image file systems
ext4, ext3, xfs and btrfs. ext4, or ext6 image on ext4 file system has
some problems but it seems that ext4 punch hole implementation is
somewhat flawed and it is unrelated to this commit.
Also this is a very good method of validating file systems punch hole
implementation.
Note that when encryption is used, discard support is disabled, because
using it might leak some information useful for possible attacker.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
LOOP_CLR_FD takes lo->lo_ctl_mutex and tries to remove the loop sysfs
files. Sysfs calls show() and waits for lo->lo_ctl_mutex. LOOP_CLR_FD
waits for show() to finish to remove the sysfs file.
cat /sys/class/block/loop0/loop/backing_file
mutex_lock_nested+0x176/0x350
? loop_attr_do_show_backing_file+0x2f/0xd0 [loop]
? loop_attr_do_show_backing_file+0x2f/0xd0 [loop]
loop_attr_do_show_backing_file+0x2f/0xd0 [loop]
dev_attr_show+0x1b/0x60
? sysfs_read_file+0x86/0x1a0
? __get_free_pages+0x12/0x50
sysfs_read_file+0xaf/0x1a0
ioctl(LOOP_CLR_FD):
wait_for_common+0x12c/0x180
? try_to_wake_up+0x2a0/0x2a0
wait_for_completion+0x18/0x20
sysfs_deactivate+0x178/0x180
? sysfs_addrm_finish+0x43/0x70
? sysfs_addrm_start+0x1d/0x20
sysfs_addrm_finish+0x43/0x70
sysfs_hash_and_remove+0x85/0xa0
sysfs_remove_group+0x59/0x100
loop_clr_fd+0x1dc/0x3f0 [loop]
lo_ioctl+0x223/0x7a0 [loop]
Instead of taking the lo_ctl_mutex from sysfs code, take the inner
lo->lo_lock, to protect the access to the backing_file data.
Thanks to Tejun for help debugging and finding a solution.
Cc: Milan Broz <mbroz@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Instead of unconditionally creating a fixed number of dead loop
devices which need to be investigated by storage handling services,
even when they are never used, we allow distros start with 0
loop devices and have losetup(8) and similar switch to the dynamic
/dev/loop-control interface instead of searching /dev/loop%i for free
devices.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Loop devices today have a fixed pre-allocated number of usually 8.
The number can only be changed at module init time. To find a free
device to use, /dev/loop%i needs to be scanned, and all devices need
to be opened until a free one is possibly found.
This adds a new /dev/loop-control device node, that allows to
dynamically find or allocate a free device, and to add and remove loop
devices from the running system:
LOOP_CTL_ADD adds a specific device. Arg is the number
of the device. It returns the device i or a negative
error code.
LOOP_CTL_REMOVE removes a specific device, Arg is the
number the device. It returns the device i or a negative
error code.
LOOP_CTL_GET_FREE finds the next unbound device or allocates
a new one. No arg is given. It returns the device i or a
negative error code.
The loop kernel module gets automatically loaded when
/dev/loop-control is accessed the first time. The alias
specified in the module, instructs udev to create this
'dead' device node, even when the module is not loaded.
Example:
cfd = open("/dev/loop-control", O_RDWR);
# add a new specific loop device
err = ioctl(cfd, LOOP_CTL_ADD, devnr);
# remove a specific loop device
err = ioctl(cfd, LOOP_CTL_REMOVE, devnr);
# find or allocate a free loop device to use
devnr = ioctl(cfd, LOOP_CTL_GET_FREE);
sprintf(loopname, "/dev/loop%i", devnr);
ffd = open("backing-file", O_RDWR);
lfd = open(loopname, O_RDWR);
err = ioctl(lfd, LOOP_SET_FD, ffd);
Cc: Tejun Heo <tj@kernel.org>
Cc: Karel Zak <kzak@redhat.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Replace the linked list, that keeps track of allocated devices, with an
idr index to allow a more efficient lookup of devices.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>