Raise the default max request size for nbd to 128KB (from 127KB) to get it
4KB aligned. This patch also allows the max request size to be increased
(via /sys/block/nbd<x>/queue/max_sectors_kb) to 32MB.
The patch makes nbd network traffic more efficient by:
- reducing request fragmentation (4KB alignment)
- reducing the number of requests (fewer round trips, less network overhead)
Especially in high latency networks, larger request size can make a dramatic
Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Michal Belczyk <belczyk@bsd.krakow.pl>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph fix from Sage Weil:
"It's a simple fix for a hard to hit race, but low-risk and clearly
correct"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: do a safe list traversal in rbd_img_request_submit()
It's possible that the reference to the object request dropped
inside the loop in rbd_img_request_submit() will be the last
one, in which case the content of the object pointer can't be
trusted.
Use a safe form of the object request list traversal to avoid
problems.
This resolves:
http://tracker.ceph.com/issues/4705
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull block fixes from Jens Axboe:
"I've got a few smaller fixes queued up for 3.9 that should go in. The
major one is the loop regression, the others are nice fixes on their
own though. It contains:
- Fix for unitialized var in the block sysfs code, courtesy of Arnd
and gcc-4.8.
- Two fixes for mtip32xx, fixing probe and command timeout. Also a
debug measure that could have waited for 3.10, but it's driver
only, so I let it slip in.
- Revert the loop partition cleanup fix, it could cause a deadlock on
auto-teardown as part of umount. The fix is clear, but at this
point we just want to revert it and get a real fix in for 3.10."
* tag 'for-linus-20130409' of git://git.kernel.dk/linux-block:
Revert "loop: cleanup partitions when detaching loop device"
mtip32xx: fix two smatch warnings
mtip32xx: Add debugfs entry device_status
mtip32xx: return 0 from pci probe in case of rebuild
mtip32xx: recovery from command timeout
block: avoid using uninitialized value in from queue_var_store
This reverts commit 8761a3dc1f.
There are situations where the destruction path is called
with the bdev->bd_mutex already held, which then deadlocks in
loop_clr_fd(). The normal partition cleanup does a trylock()
on the mutex, but it'd be nice to have a more bullet proof
method in loop. So punt this more involved fix to the next
merge window, and just back out this buggy fix for now.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dan reports:
New smatch warnings:
drivers/block/mtip32xx/mtip32xx.c:2728 show_device_status() warn: variable dereferenced before check 'dd' (see line 2727)
drivers/block/mtip32xx/mtip32xx.c:2758 show_device_status() warn: variable dereferenced before check 'dd' (see line 2757)
which are checking if dd == NULL, in a list_for_each_entry() type loop.
Get rid of the check, dd can never be NULL here.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch adds a new debugfs entry 'device_status' in
/sys/kernel/debug/rssd. The value of this entry shows
devices online and those in the process of removing.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix to return 0 from pci probe in case of rebuild. If not, pci consider
probe has failed, and crash during rmmod.
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To recover from command timeouts, reset the device. In addition
to that improved timeout handling of PIO commands.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct block_device lifecycle is defined by its inode (see fs/block_dev.c) -
block_device allocated first time we access /dev/loopXX and deallocated on
bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile"
we want that block_device stay alive until we destroy the loop device
with "losetup -d".
But because we do not hold /dev/loopXX inode its counter goes 0, and
inode/bdev can be destroyed at any moment. Usually it happens at memory
pressure or when user drops inode cache (like in the test below). When later in
loop_clr_fd() we want to use bdev we have use-after-free error with following
stack:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000280
bd_set_size+0x10/0xa0
loop_clr_fd+0x1f8/0x420 [loop]
lo_ioctl+0x200/0x7e0 [loop]
lo_compat_ioctl+0x47/0xe0 [loop]
compat_blkdev_ioctl+0x341/0x1290
do_filp_open+0x42/0xa0
compat_sys_ioctl+0xc1/0xf20
do_sys_open+0x16e/0x1d0
sysenter_dispatch+0x7/0x1a
To prevent use-after-free we need to grab the device in loop_set_fd()
and put it later in loop_clr_fd().
The issue is reprodusible on current Linus head and v3.3. Here is the test:
dd if=/dev/zero of=loop.file bs=1M count=1
while [ true ]; do
losetup /dev/loop0 loop.file
echo 2 > /proc/sys/vm/drop_caches
losetup -d /dev/loop0
done
[ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every
time we call loop_set_fd() we check that loop_device->lo_state is
Lo_unbound and set it to Lo_bound If somebody will try to set_fd again
it will get EBUSY. And if we try to loop_clr_fd() on unbound loop
device we'll get ENXIO.
loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under
loop_device->lo_ctl_mutex. ]
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) sadb_msg prepared for IPSEC userspace forgets to initialize the
satype field, fix from Nicolas Dichtel.
2) Fix mac80211 synchronization during station removal, from Johannes
Berg.
3) Fix IPSEC sequence number notifications when they wrap, from Steffen
Klassert.
4) Fix cfg80211 wdev tracing crashes when add_virtual_intf() returns an
error pointer, from Johannes Berg.
5) In mac80211, don't call into the channel context code with the
interface list mutex held. From Johannes Berg.
6) In mac80211, if we don't actually associate, do not restart the STA
timer, otherwise we can crash. From Ben Greear.
7) Missing dma_mapping_error() check in e1000, ixgb, and e1000e. From
Christoph Paasch.
8) Fix sja1000 driver defines to not conflict with SH port, from Marc
Kleine-Budde.
9) Don't call il4965_rs_use_green with a NULL station, from Colin Ian
King.
10) Suspend/Resume in the FEC driver fail because the buffer descriptors
are not initialized at all the moments in which they should. Fix
from Frank Li.
11) cpsw and davinci_emac drivers both use the wrong interface to
restart a stopped TX queue. Use netif_wake_queue not
netif_start_queue, the latter is for initialization/bringup not
active management of the queue. From Mugunthan V N.
12) Fix regression in rate calculations done by
psched_ratecfg_precompute(), missing u64 type promotion. From
Sergey Popovich.
13) Fix length overflow in tg3 VPD parsing, from Kees Cook.
14) AOE driver fails to allocate enough headroom, resulting in crashes.
Fix from Eric Dumazet.
15) RX overflow happens too quickly in sky2 driver because pause packet
thresholds are not programmed correctly. From Mirko Lindner.
16) Bonding driver manages arp_interval and miimon settings incorrectly,
disabling one unintentionally disables both. Fix from Nikolay
Aleksandrov.
17) smsc75xx drivers don't program the RX mac properly for jumbo frames.
Fix from Steve Glendinning.
18) Fix off-by-one in Codel packet scheduler. From Vijay Subramanian.
19) Fix packet corruption in atl1c by disabling MSI support, from Hannes
Frederic Sowa.
20) netdev_rx_handler_unregister() needs a synchronize_net() to fix
crashes in bonding driver unload stress tests. From Eric Dumazet.
21) rxlen field of ks8851 RX packet descriptors not interpreted
correctly (it is 12 bits not 16 bits, so needs to be masked after
shifting the 32-bit value down 16 bits). Fix from Max Nekludov.
22) Fix missed RX/TX enable in sh_eth driver due to mishandling of link
change indications. From Sergei Shtylyov.
23) Fix crashes during spurious ECI interrupts in sh_eth driver, also
from Sergei Shtylyov.
24) dm9000 driver initialization is done wrong for revision B devices
with DSP PHY, from Joseph CHANG.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (53 commits)
DM9000B: driver initialization upgrade
sh_eth: make 'link' field of 'struct sh_eth_private' *int*
sh_eth: workaround for spurious ECI interrupt
sh_eth: fix handling of no LINK signal
ks8851: Fix interpretation of rxlen field.
net: add a synchronize_net() in netdev_rx_handler_unregister()
MAINTAINERS: Update netxen_nic maintainers list
atl1e: drop pci-msi support because of packet corruption
net: fq_codel: Fix off-by-one error
net: calxedaxgmac: Wake-on-LAN fixes
net: calxedaxgmac: fix rx ring handling when OOM
net: core: Remove redundant call to 'nf_reset' in 'dev_forward_skb'
smsc75xx: fix jumbo frame support
net: fix the use of this_cpu_ptr
bonding: fix disabling of arp_interval and miimon
ipv6: don't accept node local multicast traffic from the wire
sky2: Threshold for Pause Packet is set wrong
sky2: Receive Overflows not counted
aoe: reserve enough headroom on skbs
line up comment for ndo_bridge_getlink
...
Pull block fixes from Jens Axboe:
"Alright, this time from 10K up in the air.
Collection of fixes that have been queued up since the merge window
opened, hence postponed until later in the cycle. The pull request
contains:
- A bunch of fixes for the xen blk front/back driver.
- A round of fixes for the new IBM RamSan driver, fixing various
nasty issues.
- Fixes for multiple drives from Wei Yongjun, bad handling of return
values and wrong pointer math.
- A fix for loop properly killing partitions when being detached."
* tag 'for-linus-20130331' of git://git.kernel.dk/linux-block: (25 commits)
mg_disk: fix error return code in mg_probe()
rsxx: remove unused variable
rsxx: enable error return of rsxx_eeh_save_issued_dmas()
block: removes dynamic allocation on stack
Block: blk-flush: Fixed indent code style
cciss: fix invalid use of sizeof in cciss_find_cfgtables()
loop: cleanup partitions when detaching loop device
loop: fix error return code in loop_add()
mtip32xx: fix error return code in mtip_pci_probe()
xen-blkfront: remove frame list from blk_shadow
xen-blkfront: pre-allocate pages for requests
xen-blkback: don't store dev_bus_addr
xen-blkfront: switch from llist to list
xen-blkback: fix foreach_grant_safe to handle empty lists
xen-blkfront: replace kmalloc and then memcpy with kmemdup
xen-blkback: fix dispatch_rw_block_io() error path
rsxx: fix missing unlock on error return in rsxx_eeh_remap_dmas()
Adding in EEH support to the IBM FlashSystem 70/80 device driver
block: IBM RamSan 70/80 error message bug fix.
block: IBM RamSan 70/80 branding changes.
...
Pull ceph fix from Sage Weil:
"This fixes a regression introduced during the last merge window when
mapping non-existent images."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: don't zero-fill non-image object requests
A result of ENOENT from a read request for an object that's part of
an rbd image indicates that there is a hole in that portion of the
image. Similarly, a short read for such an object indicates that
the remainder of the read should be interpreted a full read with
zeros filling out the end of the request.
This behavior is not correct for objects that are not backing rbd
image data. Currently rbd_img_obj_request_callback() assumes it
should be done for all objects.
Change rbd_img_obj_request_callback() so it only does this zeroing
for image objects. Encapsulate that special handling in its own
function. Add an assertion that the image object request is a bio
request, since we assume that (and we currently don't support any
other types).
This resolves a problem identified here:
http://tracker.ceph.com/issues/4559
The regression was introduced by bf0d5f503d.
Reported-by: Dan van der Ster <dan@vanderster.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
Some network drivers use a non default hard_header_len
Transmitted skb should take into account dev->hard_header_len, or risk
crashes or expensive reallocations.
In the case of aoe, lets reserve MAX_HEADER bytes.
David reported a crash in defxx driver, solved by this patch.
Reported-by: David Oostdyk <daveo@ll.mit.edu>
Tested-by: David Oostdyk <daveo@ll.mit.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ed Cashin <ecashin@coraid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix to return a negative error code from the error handling
case instead of 0, as returned elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit d8d595df introduced a bug where we did not check for a NULL
return from kmalloc(). Make rsxx_eeh_save_issued_dmas() return an
error for that case, and make the callers handle that.
Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull NVMe driver update from Matthew Wilcox:
"These patches have mostly been baking for a few months; sorry I didn't
get them in during the merge window. They're all bug fixes, except
for the addition of the SMART log and the addition to MAINTAINERS."
* git://git.infradead.org/users/willy/linux-nvme:
NVMe: Add namespaces with no LBA range feature
MAINTAINERS: Add entry for the NVMe driver
NVMe: Initialize iod nents to 0
NVMe: Define SMART log
NVMe: Add result to nvme_get_features
NVMe: Set result from user admin command
NVMe: End queued bio requests when freeing queue
NVMe: Free cmdid on nvme_submit_bio error
The LBA Range Type feature is optional in the NVMe specification,
so we should continue with adding namespaces for controllers that do
not implement this feature.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Any partitions added by user space to the loop device were being
left in place after detaching the loop device. This was because
the detach path issued a BLKRRPART to clean up partitions if
LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto
scanned on attach. Replace this BLKRRPART with code that
unconditionally cleans up partitions on detach instead.
Signed-off-by: Phillip Susi <psusi@ubuntu.com>
Modified by Jens to export delete_partition().
Signed-off-by: Jens Axboe <axboe@kernel.dk>