Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.
Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"
* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info
Pull NFS client updates from Trond Myklebust:
"Highlights incluse:
Features:
- Removing the forced serialisation of open()/close() calls in
NFSv4.x (x>0) makes for a significant performance improvement in
metadata intensive workloads.
- Full support for the pNFS "flexible files" layout type
- Further RPC/RDMA client improvements from Chuck
Bugfixes:
- Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING
- Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL
- Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called
as part of the namespace cleanup,
- Stable fix: Ensure we reference the inode for return-on-close in
delegreturn
- Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to
the same source address/port combination during a disconnect/
reconnect event. This is a requirement imposed by most NFSv3
server duplicate reply cache implementations.
Optimisations:
- Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT
Other:
- Add Anna Schumaker as co-maintainer for the NFS client"
* tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits)
SUNRPC: Cleanup to remove xs_tcp_close()
pnfs: delete an unintended goto
pnfs/flexfiles: Do not dprintk after the free
SUNRPC: Fix stupid typo in xs_sock_set_reuseport
SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG
SUNRPC: Handle connection reset more efficiently.
SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag
SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release
SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection
SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT
SUNRPC: Remove TCP socket linger code
SUNRPC: Remove TCP client connection reset hack
SUNRPC: TCP/UDP always close the old socket before reconnecting
SUNRPC: Add helpers to prevent socket create from racing
SUNRPC: Ensure xs_reset_transport() resets the close connection flags
SUNRPC: Do not clear the source port in xs_reset_transport
SUNRPC: Handle EADDRINUSE on connect
SUNRPC: Set SO_REUSEPORT socket option for TCP connections
NFSv4.1: Fix pnfs_put_lseg races
NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS
...
This patch adds mirrored write support to the pgio layer. The default
is to use one mirror, but pgio callers may define callbacks to change
this to any value up to the (arbitrarily selected) limit of 16.
The basic idea is to break out members of nfs_pageio_descriptor that cannot
be shared between mirrored DSes and put them in a new structure.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Pass ds_commit_idx through the nfs commit path. It's used to select
the commit bucket when using pnfs and is ignored when not using pnfs.
Several functions had to be changed: nfs_retry_commit,
nfs_mark_request_commit, pnfs_mark_request_commit and the pnfs layout
driver .mark_request_commit functions.
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Now that we got rid of the bdi abuse on character devices we can always use
sb->s_bdi to get at the backing_dev_info for a file, except for the block
device special case. Export inode_to_bdi and replace uses of
mapping->backing_dev_info with it to prepare for the removal of
mapping->backing_dev_info.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can now add a dedicated spinlock without expanding struct inode.
Change to using that to protect the various i_flctx lists.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Recent work in the pgio layer made it possible for there to be more than one
request per page. This caused a subtle change in commit behavior, because
write.c:nfs_commit_unstable_pages compares the number of *pages* waiting for
writeback against the number of requests on a commit list to choose when to
send a COMMIT in a non-blocking flush.
This is probably hard to hit in normal operation - you have to be using
rsize/wsize < PAGE_SIZE, or pnfs with lots of boundaries that are not page
aligned to have a noticeable change in behavior.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This WARN_ON_ONCE was supposed to catch reference counting bugs, but can
trigger in inappropriate situations.
This was reproducible using NFSv2 on an architecture with 64K pages -- we
verified that it was not a reference counting bug and the warning was
safe to ignore.
Reported-by: Will Deacon <will.deacon@arm.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If nfs_release_page() is called on a sequence of pages which are all
in the same file which is blocked on COMMIT, each page could
contribute a 1 second delay which could be come excessive. I have
seen delays of as much as 208 seconds.
To keep the delay to one second, mark the bdi as write-congested
if the commit didn't finished. Once it does finish, the
write-congested flag will be cleared by nfs_commit_release_pages().
With this, the longest total delay in try_to_free_pages that I have
seen is under 3 seconds. With no waiting in nfs_release_page at all
I have seen delays of nearly 1.5 seconds.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.
If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service. If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.
nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().
This patch modifies nfs_release_page() to wait a limited time for the
commit to complete - one second. If the commit doesn't complete
in this time, nfs_release_page() will fail. This means it might now
fail in some cases where it wouldn't before. These cases are only
when 'gfp' includes '__GFP_WAIT'.
nfs_release_page() is only called by try_to_release_page(), and that
can only be called on an NFS page with required 'gfp' flags from
- page_cache_pipe_buf_steal() in splice.c
- shrink_page_list() in vmscan.c
- invalidate_inode_pages2_range() in truncate.c
The first two handle failure quite safely. The last is only called
after ->launder_page() has been called, and that will have waited
for the commit to finish already.
So aborting if the commit takes longer than 1 second is perfectly safe.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
commit b31268ac79
FS: Use stable writes when not doing a bulk flush
was a bit heavy handed.
The particular problem that lead to this patch was that
small writes to an O_SYNC file we being written as UNSTABLE writes
followed by a commit.
This is appropriate for large writes (which require multiple NFS
requests) but for small writes (single NFS request), using
NFS_FILE_SYNC is more efficient.
So that patch causes the code to select between the two methods
depending on how many nfs requests get generated.
Unfortunately this ends up applying to non O_SYNC writes as well.
In particular if you memory-map a file and update random pages, then
when they are eventually written out by writeback they will go as
NFS_FILE_SYNC. This is inefficient and slows down the application.
So: only set FLUSH_COND_STABLE when wbc->sync_mode is WB_SYNC_ALL.
With this patch:
O_SYNC writes are NFS_FILE_SYNC for single requests, and NFS_UNSTABLE
followed by COMMIT for multiple requests
Writing immediately before close of fsync follow the same pattern.
Non-O_SYNC writes without an fsync of close eventually get flushed
out as UNSTABLE and a commit follows eventually as appropriate.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The goal is to create a generic NFS module with code that does not
depend on what versions of NFS are enabled.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
gcc reports:
linux/fs/nfs/write.c: In function ‘nfs_page_find_head_request_locked.isra.17’:
linux/fs/nfs/write.c:121:64: warning: ‘cinfo.mds’ may be used uninitialized in this function [-Wmaybe-uninitialized]
list_for_each_entry_safe(freq, t, &cinfo.mds->list, wb_list) {
^
linux/fs/nfs/write.c:110:25: note: ‘cinfo.mds’ was declared here
struct nfs_commit_info cinfo;
Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This handles the 'nonblock=false' case in nfs_lock_and_join_requests.
If the group is already locked and blocking is allowed, drop the inode lock
and wait for the group lock to be cleared before trying it all again.
This should fix warnings found in peterz's tree (sched/wait branch), where
might_sleep() checks are added to wait.[ch].
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Flip the meaning of the second argument from 'wait' to 'nonblock' to
match related functions. Update all five calls to reflect this change.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>