Pull block driver updates from Jens Axboe:
"On top of the core pull request, this is the drivers pull request for
this merge window. This contains:
- Switch drivers to the new write back cache API, and kill off the
flush flags. From me.
- Kill the discard support for the STEC pci-e flash driver. It's
trivially broken, and apparently unmaintained, so it's safer to
just remove it. From Jeff Moyer.
- A set of lightnvm updates from the usual suspects (Matias/Javier,
and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei
Tao.
- A set of updates for NVMe:
- Turn the controller state management into a proper state
machine. From Christoph.
- Shuffling of code in preparation for NVMe-over-fabrics, also
from Christoph.
- Cleanup of the command prep part from Ming Lin.
- Rewrite of the discard support from Ming Lin.
- Deadlock fix for namespace removal from Ming Lin.
- Use the now exported blk-mq tag helper for IO termination.
From Sagi.
- Various little fixes from Christoph, Guilherme, Keith, Ming
Lin, Wang Sheng-Hui.
- Convert mtip32xx to use the now exported blk-mq tag iter function,
from Keith"
* 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits)
lightnvm: reserved space calculation incorrect
lightnvm: rename nr_pages to nr_ppas on nvm_rq
lightnvm: add is_cached entry to struct ppa_addr
lightnvm: expose gennvm_mark_blk to targets
lightnvm: remove mgt targets on mgt removal
lightnvm: pass dma address to hardware rather than pointer
lightnvm: do not assume sequential lun alloc.
nvme/lightnvm: Log using the ctrl named device
lightnvm: rename dma helper functions
lightnvm: enable metadata to be sent to device
lightnvm: do not free unused metadata on rrpc
lightnvm: fix out of bound ppa lun id on bb tbl
lightnvm: refactor set_bb_tbl for accepting ppa list
lightnvm: move responsibility for bad blk mgmt to target
lightnvm: make nvm_set_rqd_ppalist() aware of vblks
lightnvm: remove struct factory_blks
lightnvm: refactor device ops->get_bb_tbl()
lightnvm: introduce nvm_for_each_lun_ppa() macro
lightnvm: refactor dev->online_target to global nvm_targets
lightnvm: rename nvm_targets to nvm_tgt_type
...
This is a version of blkdev_issue_discard which doesn't wait for
the I/O to complete, but instead allows the caller to submit
the final bio and/or chain it to others.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagig@grimberg.me>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we converted everything to the newer block write cache
interface, kill off the queue flush_flags and queueable flush
entries.
Signed-off-by: Jens Axboe <axboe@fb.com>
We don't have any drivers left using it, so kill it off. Update
documentation to use the newer blk_queue_write_cache().
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add an internal helper and flag for setting whether a queue has
write back caching, or write through (or none). Add a sysfs file
to show this as well, and make it changeable from user space.
This will replace the (awkward) blk_queue_flush() interface that
drivers currently use to inform the block layer of write cache state
and capabilities.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We could kmalloc() the payload, so need the offset in page.
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs updates from Al Viro:
- Preparations of parallel lookups (the remaining main obstacle is the
need to move security_d_instantiate(); once that becomes safe, the
rest will be a matter of rather short series local to fs/*.c
- preadv2/pwritev2 series from Christoph
- assorted fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
splice: handle zero nr_pages in splice_to_pipe()
vfs: show_vfsstat: do not ignore errors from show_devname method
dcache.c: new helper: __d_add()
don't bother with __d_instantiate(dentry, NULL)
untangle fsnotify_d_instantiate() a bit
uninline d_add()
replace d_add_unique() with saner primitive
quota: use lookup_one_len_unlocked()
cifs_get_root(): use lookup_one_len_unlocked()
nfs_lookup: don't bother with d_instantiate(dentry, NULL)
kill dentry_unhash()
ceph_fill_trace(): don't bother with d_instantiate(dn, NULL)
autofs4: don't bother with d_instantiate(dentry, NULL) in ->lookup()
configfs: move d_rehash() into configfs_create() for regular files
ceph: don't bother with d_rehash() in splice_dentry()
namei: teach lookup_slow() to skip revalidate
namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()
lookup_one_len_unlocked(): use lookup_dcache()
namei: simplify invalidation logics in lookup_dcache()
namei: change calling conventions for lookup_{fast,slow} and follow_managed()
...
Pull libata updates from Tejun Heo:
- ahci grew runtime power management support so that the controller can
be turned off if no devices are attached.
- sata_via isn't dead yet. It got hotplug support and more refined
workaround for certain WD drives.
- Misc cleanups. There's a merge from for-4.5-fixes to avoid confusing
conflicts in ahci PCI ID table.
* 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
ata: ahci_xgene: dereferencing uninitialized pointer in probe
AHCI: Remove obsolete Intel Lewisburg SATA RAID device IDs
ata: sata_rcar: Use ARCH_RENESAS
sata_via: Implement hotplug for VT6421
sata_via: Apply WD workaround only when needed on VT6421
ahci: Add runtime PM support for the host controller
ahci: Add functions to manage runtime PM of AHCI ports
ahci: Convert driver to use modern PM hooks
ahci: Cache host controller version
scsi: Drop runtime PM usage count after host is added
scsi: Set request queue runtime PM status back to active on resume
block: Add blk_set_runtime_active()
ata: ahci_mvebu: add support for Armada 3700 variant
libata: fix unbalanced spin_lock_irqsave/spin_unlock_irq() in ata_scsi_park_show()
libata: support AHCI on OCTEON platform
Now that applications need to explicitly ask for polling we can enable it
by default in blk-mq drivers. Note that this will only have an affect
on driver that supply a poll function, which currently only includes nvme.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Driver private request types should not get the artifical cap for the
FS requests. This is important to use the full device capabilities
for internal command or NVMe pass through commands.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jeff Lien <Jeff.Lien@hgst.com>
Tested-by: Jeff Lien <Jeff.Lien@hgst.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Updated by me to use an explicit check for the one command type that
does support extended checking, instead of relying on the ordering
of the enum command values - as suggested by Keith.
Signed-off-by: Jens Axboe <axboe@fb.com>
In the following patch, the way for figuring out
the last bvec will be changed with a bit cost introduced,
so return immediately if the queue doesn't have virt
boundary limit. Actually most of devices have not
this limit.
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If block device is left runtime suspended during system suspend, resume
hook of the driver typically corrects runtime PM status of the device back
to "active" after it is resumed. However, this is not enough as queue's
runtime PM status is still "suspended". As long as it is in this state
blk_pm_peek_request() returns NULL and thus prevents new requests to be
processed.
Add new function blk_set_runtime_active() that can be used to force the
queue status back to "active" as needed.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
When a storage device rejects a WRITE SAME command we will disable write
same functionality for the device and return -EREMOTEIO to the block
layer. -EREMOTEIO will in turn prevent DM from retrying the I/O and/or
failing the path.
Yiwen Jiang discovered a small race where WRITE SAME requests issued
simultaneously would cause -EIO to be returned. This happened because
any requests being prepared after WRITE SAME had been disabled for the
device caused us to return BLKPREP_KILL. The latter caused the block
layer to return -EIO upon completion.
To overcome this we introduce BLKPREP_INVALID which indicates that this
is an invalid request for the device. blk_peek_request() is modified to
return -EREMOTEIO in that case.
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Suggested-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Reviewed-by: Ewan Milne <emilne@redhat.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull NVMe updates from Jens Axboe:
"Last branch for this series is the nvme changes. It's in a separate
branch to avoid splitting too much between core and NVMe changes,
since NVMe is still helping drive some blk-mq changes. That said, not
a huge amount of core changes in here. The grunt of the work is the
continued split of the code"
* 'for-4.5/nvme' of git://git.kernel.dk/linux-block: (67 commits)
uapi: update install list after nvme.h rename
NVMe: Export NVMe attributes to sysfs group
NVMe: Shutdown controller only for power-off
NVMe: IO queue deletion re-write
NVMe: Remove queue freezing on resets
NVMe: Use a retryable error code on reset
NVMe: Fix admin queue ring wrap
nvme: make SG_IO support optional
nvme: fixes for NVME_IOCTL_IO_CMD on the char device
nvme: synchronize access to ctrl->namespaces
nvme: Move nvme_freeze/unfreeze_queues to nvme core
PCI/AER: include header file
NVMe: Export namespace attributes to sysfs
NVMe: Add pci error handlers
block: remove REQ_NO_TIMEOUT flag
nvme: merge iod and cmd_info
nvme: meta_sg doesn't have to be an array
nvme: properly free resources for cancelled command
nvme: simplify completion handling
nvme: special case AEN requests
...
Pull core block updates from Jens Axboe:
"We don't have a lot of core changes this time around, it's mostly in
drivers, which will come in a subsequent pull.
The cores changes include:
- blk-mq
- Prep patch from Christoph, changing blk_mq_alloc_request() to
take flags instead of just using gfp_t for sleep/nosleep.
- Doc patch from me, clarifying the difference between legacy
and blk-mq for timer usage.
- Fixes from Raghavendra for memory-less numa nodes, and a reuse
of CPU masks.
- Cleanup from Geliang Tang, using offset_in_page() instead of open
coding it.
- From Ilya, rename request_queue slab to it reflects what it holds,
and a fix for proper use of bdgrab/put.
- A real fix for the split across stripe boundaries from Keith. We
yanked a broken version of this from 4.4-rc final, this one works.
- From Mike Krinkin, emit a trace message when we split.
- From Wei Tang, two small cleanups, not explicitly clearing memory
that is already cleared"
* 'for-4.5/core' of git://git.kernel.dk/linux-block:
block: use bd{grab,put}() instead of open-coding
block: split bios to max possible length
block: add call to split trace point
blk-mq: Avoid memoryless numa node encoded in hctx numa_node
blk-mq: Reuse hardware context cpumask for tags
blk-mq: add a flags parameter to blk_mq_alloc_request
Revert "blk-flush: Queue through IO scheduler when flush not required"
block: clarify blk_add_timer() use case for blk-mq
bio: use offset_in_page macro
block: do not initialise statics to 0 or NULL
block: do not initialise globals to 0 or NULL
block: rename request_queue slab cache
For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags. These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory". Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.
The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e. O_DIRECT I/O with a persistent memory source/target). However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled. Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.
This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().
[willy@linux.intel.com: fix read() of a hole]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently only have an inline/sync helper to restart a stopped
queue. If drivers need an async version, they have to roll their
own. Add a generic helper instead.
Signed-off-by: Jens Axboe <axboe@fb.com>
Timer context is not very useful for drivers to perform any meaningful abort
action from. So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.
Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals. But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)
Contains a major update from Keith Bush:
"This patch removes synchronizing the timeout work so that the timer can
start a freeze on its own queue. The timer enters the queue, so timer
context can only start a freeze, but not wait for frozen."
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull SCSI fixes from James Bottomley:
"This is quite a bumper crop of fixes: three from Arnd correcting
various build issues in some configurations, a lock recursion in
qla2xxx. Two potentially exploitable issues in hpsa and mvsas, a
potential null deref in st, a revert of a bdi registration fix that
turned out to cause even more problems, a set of fixes to allow people
who only defined MPT2SAS to still work after the mpt2/mpt3sas merger
and a couple of fixes for issues turned up by the hyper-v storvsc
driver"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
mpt3sas: fix Kconfig dependency problem for mpt2sas back compatibility
Revert "scsi: Fix a bdi reregistration race"
mpt3sas: Add dummy Kconfig option for backwards compatibility
Fix a memory leak in scsi_host_dev_release()
block/sd: Fix device-imposed transfer length limits
scsi_debug: fix prevent_allow+verify regressions
MAINTAINERS: Add myself as co-maintainer of the SCSI subsystem.
sd: Make discard granularity match logical block size when LBPRZ=1
scsi: hpsa: select CONFIG_SCSI_SAS_ATTR
scsi: advansys needs ISA dma api for ISA support
scsi_sysfs: protect against double execution of __scsi_remove_device()
st: fix potential null pointer dereference.
scsi: report 'INQUIRY result too short' once per host
advansys: fix big-endian builds
qla2xxx: Fix rwlock recursion
hpsa: logical vs bitwise AND typo
mvsas: don't allow negative timeouts
mpt3sas: Fix use sas_is_tlr_enabled API before enabling MPI2_SCSIIO_CONTROL_TLR_ON flag