Commit Graph

3580 Commits

Author SHA1 Message Date
Christoph Hellwig 08a899d5d9 nfs: setattr can only change regular file sizes
The VFS never calls setattr with ATTR_SIZE on anything but regular
files.  Remove the if check and turn it into an assert similar to
what some other file systems do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 20d655d619 pnfs/blocklayout: use the device id cache
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 30ff0603ca pnfs: add a nfs4_get_deviceid helper
This will be used by the block layout driver when splitting extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 9dd2fcd32f pnfs: add a common GETDEVICELIST implementation
At a simple helper to issue a GETDEVICELIST operation and pre-load
the device id cache based on the result.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 661373b13d pnfs: factor GETDEVICEINFO implementations
Add support to the common pNFS core to issue GETDEVICEINFO calls on
a device ID cache miss.  The code is taken from the well debugged
file layout implementation and calls out to the layoutdriver through
a new alloc_deviceid_node method.  The calling conventions for
nfs4_find_get_deviceid are changed so that all information needed to
send a GETDEVICEINFO request is passed to the common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 848746bd24 pnfs/blocklayout: return layouts on setattr
This speads up truncate-heavy workloads like fsx by multiple orders of
magnitude.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 71d5b76302 pnfs/blocklayout: implement the return_range method
This allows removing extents from the extent tree especially on truncate
operations, and thus fixing reads from truncated and re-extended that
previously returned stale data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 8067253c8c pnfs/blocklayout: rewrite extent tracking
Currently the block layout driver tracks extents in three separate
data structures:

 - the two list of pnfs_block_extent structures returned by the server
 - the list of sectors that were in invalid state but have been written to
 - a list of pnfs_block_short_extent structures for LAYOUTCOMMIT

All of these share the property that they are not only highly inefficient
data structures, but also that operations on them are even more inefficient
than nessecary.

In addition there are various implementation defects like:

 - using an int to track sectors, causing corruption for large offsets
 - incorrect normalization of page or block granularity ranges
 - insufficient error handling
 - incorrect synchronization as extents can be modified while they are in
   use

This patch replace all three data with a single unified rbtree structure
tracking all extents, as well as their in-memory state, although we still
need to instance for read-only and read-write extent due to the arcane
client side COW feature in the block layouts spec.

To fix the problem of extent possibly being modified while in use we make
sure to return a copy of the extent for use in the write path - the
extent can only be invalidated by a layout recall or return which has
to wait until the I/O operations finished due to refcounts on the layout
segment.

The new extent tree work similar to the schemes used by block based
filesystems like XFS or ext4.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 8c792ea940 pnfs/blocklayout: don't set pages uptodate
The core nfs code handles setting pages uptodate on reads, no need to mess
with the pageflags outselves.  Also remove a debug function to dump page
flags.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 3a6fd1f004 pnfs/blocklayout: remove read-modify-write handling in bl_write_pagelist
Use the new PNFS_READ_WHOLE_PAGE flag to offload read-modify-write
handling to core nfs code, and remove a huge chunk of deadlock prone
mess from the block layout writeback path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig c88953d87f pnfs: add return_range method
If a layout driver keeps per-inode state outside of the layout segments it
needs to be notified of any layout returns or recalls on an inode, and not
just about the freeing of layout segments.  Add a method to acomplish this,
which will allow the block layout driver to handle the case of truncated
and re-expanded files properly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 612aa983a0 pnfs: add flag to force read-modify-write in ->write_begin
Like all block based filesystems, the pNFS block layout driver can't read
or write at a byte granularity and thus has to perform read-modify-write
cycles on writes smaller than this granularity.

Add a flag so that the core NFS code always reads a whole page when
starting a smaller write, so that we can do it in the place where the VFS
expects it instead of doing in very deadlock prone way in the writeback
handler.

Note that in theory we could do less than page size reads here for disks
that have a smaller sector size which are served by a server with a smaller
pnfs block size.  But so far that doesn't seem like a worthwhile
optimization.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 7c5d187581 pnfs: force a layout commit when encountering busy segments during recall
Expedite layout recall processing by forcing a layout commit when
we see busy segments.  Without it the layout recall might have to wait
until the VM decided to start writeback for the file, which can introduce
long delays.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Trond Myklebust 3a3908c8b0 NFS: Fix a compile warning when !(CONFIG_NFS_V3 || CONFIG_NFS_V4)
gcc reports:

linux/fs/nfs/write.c: In function ‘nfs_page_find_head_request_locked.isra.17’:
linux/fs/nfs/write.c:121:64: warning: ‘cinfo.mds’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  list_for_each_entry_safe(freq, t, &cinfo.mds->list, wb_list) {
                                                                  ^
linux/fs/nfs/write.c:110:25: note: ‘cinfo.mds’ was declared here
  struct nfs_commit_info cinfo;

Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 921b81a8cd pnfs/blocklayout: correctly decrement extent length
When we do non-page sized reads we can underflow the extent_length variable
and read incorrect data.  Fix the extent_length calculation and change to
defensive <= checks for the extent length in the read and write path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig be98fd0ac3 pnfs/blocklayout: plug block queues
Make sure the block queue is plugged when performing pNFS blocklayout I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 72c5e59f63 pnfs/blocklayout: improve GETDEVICEINFO error reporting
Tell userspace what stage of GETDEVICEINFO failed so that there is a chance
to debug it, especially with the userspace daemon clusterf***k in the block
layout driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig e3aaf7f2b8 pnfs/blocklayout: reject pnfs blocksize larger than page size
The Linux VM subsystem can't support block sizes larger than page size
for block based filesystems very well.  While this can be hacked around
to some extent for simple filesystems the read-modify-write cycles
required for pnfs block invalid extents are extremly deadlock prone
when operating on multiple pages.  Reject this case early on instead
of pretending to support it (badly).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 5f919c9f10 pnfs: allow splicing pre-encoded pages into the layoutcommit args
Currently there is no XDR buffer space allocated for the per-layout driver
layoutcommit payload, which leads to server buffer overflows in the
blocklayout driver even under simple workloads.  As we can't do per-layout
sizes for XDR operations we'll have to splice a previously encoded list
of pages into the XDR stream, similar to how we handle ACL buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig 47abadefad pnfs: avoid using stale stateids after layoutreturn
After we issued a layoutreturn operations the may free the layout stateid
and will thus cause bad stateid error when the client uses it again.

We currently try to avoid this case by chosing the open stateid if not
lsegs are present for this inode.  But various places can hold refererence
on lsegs and thus cause the list not to be empty shortly after a layout
return.  Add an explicit flag to mark the current layout stateid invalid
and force usage of the openstateid after we did a full file layoutreturn.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig defb846088 pnfs: retry after a bad stateid error from layoutget
Currently we fall through to nfs4_async_handle_error when we get
a bad stateid error back from layoutget.  nfs4_async_handle_error
with a NULL state argument will never retry the operations but return
the error to higher layer, causing an avoiable fallback to MDS I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig 362f74745c pnfs: don't check sequence on new stateids in layoutget
When layoutget returns an entirely new layout stateid it should not
check the generation counter as the new stateid will start with a new
counter entirely unrelated to old one.

The current behavior causes constant layoutget failures against a block
server which allocates a new stateid after an recall that removed all
outstanding layouts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig 1013df6115 pnfs: do not pass uninitialized lsegs to ->free_lseg
Ensure the lsegs are initialized early so that we don't pass an unitialized
one back to ->free_lseg during error processing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig 2e11f8296d nfs: cap request size to fit a kmalloced page array
pNFS servers may return arbitrarily large layouts.  Trim back the I/O size
to one that we can at least allocate the page array for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Peng Tao bc7d4b8fd0 nfs/filelayout: set layoutcommit depending on write verifier
Following http://www.rfc-editor.org/errata_search.php?rfc=5661&eid=2751
Don't set layoutcommit for commit_through_mds case.
For FILE_SYNC writes, don't set layoutcommit.
For DATA_SYNC wirtes, set layout commit right after wirtes done.
For UNSTABLE writes, set layout commit when commit done.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00