Commit Graph

58 Commits

Author SHA1 Message Date
Jeff Layton 3e80dbcda7 nfsd: remove recurring workqueue job to clean DRC
We have a shrinker, we clean out the cache when nfsd is shut down, and
prune the chains on each request. A recurring workqueue job seems like
unnecessary overhead. Just remove it.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-11-10 09:25:51 -05:00
Julia Lawall e79017ddce nfsd: drop null test before destroy functions
Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL) {
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
  x = NULL;
-}
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-12 17:31:04 -04:00
Kinglong Mee a68465c9cb NFSD: Error out when register_shrinker() fail
If register_shrinker() failed, nfsd will cause a NULL pointer access as,

[ 9250.875465] nfsd: last server has exited, flushing export cache
[ 9251.427270] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 9251.427393] IP: [<ffffffff8136fc29>] __list_del_entry+0x29/0xd0
[ 9251.427579] PGD 13e4d067 PUD 13e4c067 PMD 0
[ 9251.427633] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 9251.427706] Modules linked in: ip6t_rpfilter ip6t_REJECT bnep bluetooth xt_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw btrfs xfs microcode ppdev serio_raw pcspkr xor libcrc32c raid6_pq e1000 parport_pc parport i2c_piix4 i2c_core nfsd(OE-) auth_rpcgss nfs_acl lockd sunrpc(E) ata_generic pata_acpi
[ 9251.428240] CPU: 0 PID: 1557 Comm: rmmod Tainted: G           OE 3.16.0-rc2+ #22
[ 9251.428366] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 9251.428496] task: ffff880000849540 ti: ffff8800136f4000 task.ti: ffff8800136f4000
[ 9251.428593] RIP: 0010:[<ffffffff8136fc29>]  [<ffffffff8136fc29>] __list_del_entry+0x29/0xd0
[ 9251.428696] RSP: 0018:ffff8800136f7ea0  EFLAGS: 00010207
[ 9251.428751] RAX: 0000000000000000 RBX: ffffffffa0116d48 RCX: dead000000200200
[ 9251.428814] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa0116d48
[ 9251.428876] RBP: ffff8800136f7ea0 R08: ffff8800136f4000 R09: 0000000000000001
[ 9251.428939] R10: 8080808080808080 R11: 0000000000000000 R12: ffffffffa011a5a0
[ 9251.429002] R13: 0000000000000800 R14: 0000000000000000 R15: 00000000018ac090
[ 9251.429064] FS:  00007fb9acef0740(0000) GS:ffff88003fa00000(0000) knlGS:0000000000000000
[ 9251.429164] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9251.429221] CR2: 0000000000000000 CR3: 0000000031a17000 CR4: 00000000001407f0
[ 9251.429306] Stack:
[ 9251.429410]  ffff8800136f7eb8 ffffffff8136fcdd ffffffffa0116d20 ffff8800136f7ed0
[ 9251.429511]  ffffffff8118a0f2 0000000000000000 ffff8800136f7ee0 ffffffffa00eb765
[ 9251.429610]  ffff8800136f7ef0 ffffffffa010e93c ffff8800136f7f78 ffffffff81104ac2
[ 9251.429709] Call Trace:
[ 9251.429755]  [<ffffffff8136fcdd>] list_del+0xd/0x30
[ 9251.429896]  [<ffffffff8118a0f2>] unregister_shrinker+0x22/0x40
[ 9251.430037]  [<ffffffffa00eb765>] nfsd_reply_cache_shutdown+0x15/0x90 [nfsd]
[ 9251.430106]  [<ffffffffa010e93c>] exit_nfsd+0x9/0x6cd [nfsd]
[ 9251.430192]  [<ffffffff81104ac2>] SyS_delete_module+0x162/0x200
[ 9251.430280]  [<ffffffff81013b69>] ? do_notify_resume+0x59/0x90
[ 9251.430395]  [<ffffffff816f2369>] system_call_fastpath+0x16/0x1b
[ 9251.430457] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08
[ 9251.430691] RIP  [<ffffffff8136fc29>] __list_del_entry+0x29/0xd0
[ 9251.430755]  RSP <ffff8800136f7ea0>
[ 9251.430805] CR2: 0000000000000000
[ 9251.431033] ---[ end trace 080f3050d082b4ea ]---

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-03-20 12:44:00 -04:00
Jeff Layton 4d152e2c9a sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it
In a later patch, we're going to need some atomic bit flags. Since that
field will need to be an unsigned long, we mitigate that space
consumption by migrating some other bitflags to the new field. Start
with the rq_secure flag.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-09 11:21:20 -05:00
Trond Myklebust ef9b16dc6d nfsd: Reorder nfsd_cache_match to check more powerful discriminators first
We would normally expect the xid and the checksum to be the best
discriminators. Check them before looking at the procedure number,
etc.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:13 -04:00
Trond Myklebust 89a26b3d29 nfsd: split DRC global spinlock into per-bucket locks
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:13 -04:00
Trond Myklebust 31e60f5222 nfsd: convert num_drc_entries to an atomic_t
...so we can remove the spinlocking around it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust 11acf6ef3b nfsd: Remove the cache_hash list
Now that the lru list is per-bucket, we don't need a second list for
searches.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust bedd4b61a4 nfsd: convert the lru list into a per-bucket thing
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust 7142b98d9f nfsd: Clean up drc cache in preparation for global spinlock elimination
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Jeff Layton b3d8d1284a nfsd: clean up sparse endianness warnings in nfscache.c
We currently hash the XID to determine a hash bucket to use for the
reply cache entry, which is fed into hash_32 without byte-swapping it.
Add __force to make sparse happy, and add some comments to explain
why.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:37 -04:00
Jeff Layton 1b19453d1c nfsd: don't halt scanning the DRC LRU list when there's an RC_INPROG entry
Currently, the DRC cache pruner will stop scanning the list when it
hits an entry that is RC_INPROG. It's possible however for a call to
take a *very* long time. In that case, we don't want it to block other
entries from being pruned if they are expired or we need to trim the
cache to get back under the limit.

Fix the DRC cache pruner to just ignore RC_INPROG entries.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-06 19:22:49 -04:00
Jeff Layton a0ef5e1968 nfsd: don't try to reuse an expired DRC entry off the list
Currently when we are processing a request, we try to scrape an expired
or over-limit entry off the list in preference to allocating a new one
from the slab.

This is unnecessarily complicated. Just use the slab layer.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-12-11 11:27:04 -05:00
Jeff Layton 781c2a5a5f nfsd: when reusing an existing repcache entry, unhash it first
The DRC code will attempt to reuse an existing, expired cache entry in
preference to allocating a new one. It'll then search the cache, and if
it gets a hit it'll then free the cache entry that it was going to
reuse.

The cache code doesn't unhash the entry that it's going to reuse
however, so it's possible for it end up designating an entry for reuse
and then subsequently freeing the same entry after it finds it.  This
leads it to a later use-after-free situation and usually some list
corruption warnings or an oops.

Fix this by simply unhashing the entry that we intend to reuse. That
will mean that it's not findable via a search and should prevent this
situation from occurring.

Cc: stable@vger.kernel.org # v3.10+
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: g. artim <gartim@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-12-10 20:34:44 -05:00
Dave Chinner 1ab6c4997e fs: convert fs shrinkers to new scan/count API
Convert the filesystem shrinkers to use the new API, and standardise some
of the behaviours of the shrinkers at the same time.  For example,
nr_to_scan means the number of objects to scan, not the number of objects
to free.

I refactored the CIFS idmap shrinker a little - it really needs to be
broken up into a shrinker per tree and keep an item count with the tree
root so that we don't need to walk the tree every time the shrinker needs
to count the number of objects in the tree (i.e.  all the time under
memory pressure).

[glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
[assorted fixes folded in]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Wei Yongjun c8c797f9fd nfsd: make symbol nfsd_reply_cache_shrinker static
symbol 'nfsd_reply_cache_shrinker' only used within this file. It should
be static.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-30 18:19:34 -04:00
Jeff Layton 0733c7ba1e nfsd: scale up the number of DRC hash buckets with cache size
We've now increased the size of the duplicate reply cache by quite a
bit, but the number of hash buckets has not changed. So, we've gone from
an average hash chain length of 16 in the old code to 4096 when the
cache is its largest. Change the code to scale out the number of buckets
with the max size of the cache.

At the same time, we also need to fix the hash function since the
existing one isn't really suitable when there are more than 256 buckets.
Move instead to use the stock hash_32 function for this. Testing on a
machine that had 2048 buckets showed that this gave a smaller
longest:average ratio than the existing hash function:

The formula here is longest hash bucket searched divided by average
number of entries per bucket at the time that we saw that longest
bucket:

    old hash: 68/(39258/2048) == 3.547404
    hash_32:  45/(33773/2048) == 2.728807

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:25 -04:00
Jeff Layton 98d821bda1 nfsd: keep stats on worst hash balancing seen so far
The typical case with the DRC is a cache miss, so if we keep track of
the max number of entries that we've ever walked over in a search, then
we should have a reasonable estimate of the longest hash chain that
we've ever seen.

With that, we'll also keep track of the total size of the cache when we
see the longest chain. In the case of a tie, we prefer to track the
smallest total cache size in order to properly gauge the worst-case
ratio of max vs. avg chain length.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:25 -04:00
Jeff Layton a2f999a37e nfsd: add new reply_cache_stats file in nfsdfs
For presenting statistics relating to duplicate reply cache.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:24 -04:00
Jeff Layton 6c6910cd4d nfsd: track memory utilization by the DRC
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:23 -04:00
Jeff Layton 9dc56143c2 nfsd: break out comparator into separate function
Break out the function that compares the rqstp and checksum against a
reply cache entry. While we're at it, track the efficacy of the checksum
over the NFS data by tracking the cases where we would have incorrectly
matched a DRC entry if we had not tracked it or the length.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:22 -04:00
Jeff Layton 0b9ea37f24 nfsd: eliminate one of the DRC cache searches
The most common case is to do a search of the cache, followed by an
insert. In the case where we have to allocate an entry off the slab,
then we end up having to redo the search, which is wasteful.

Better optimize the code for the common case by eliminating the initial
search of the cache and always preallocating an entry. In the case of a
cache hit, we'll end up just freeing that entry but that's preferable to
an extra search.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:22 -04:00
Jeff Layton ac534ff2d5 nfsd: fix startup order in nfsd_reply_cache_init
If we end up doing "goto out_nomem" in this function, we'll call
nfsd_reply_cache_shutdown. That will attempt to walk the LRU list and
free entries, but that list may not be initialized yet if the server is
starting up for the first time. It's also possible for the shrinker to
kick in before we've initialized the LRU list.

Rearrange the initialization so that the LRU list_head and cache size
are initialized before doing any of the allocations that might fail.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-03-18 17:21:30 -04:00
Jeff Layton a517b608fa nfsd: only unhash DRC entries that are in the hashtable
It's not safe to call hlist_del() on a newly initialized hlist_node.
That leads to a NULL pointer dereference. Only do that if the entry
is hashed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-03-18 14:58:32 -04:00
Linus Torvalds b6669737d3 Merge branch 'for-3.9' of git://linux-nfs.org/~bfields/linux
Pull nfsd changes from J Bruce Fields:
 "Miscellaneous bugfixes, plus:

   - An overhaul of the DRC cache by Jeff Layton.  The main effect is
     just to make it larger.  This decreases the chances of intermittent
     errors especially in the UDP case.  But we'll need to watch for any
     reports of performance regressions.

   - Containerized nfsd: with some limitations, we now support
     per-container nfs-service, thanks to extensive work from Stanislav
     Kinsbursky over the last year."

Some notes about conflicts, since there were *two* non-data semantic
conflicts here:

 - idr_remove_all() had been added by a memory leak fix, but has since
   become deprecated since idr_destroy() does it for us now.

 - xs_local_connect() had been added by this branch to make AF_LOCAL
   connections be synchronous, but in the meantime Trond had changed the
   calling convention in order to avoid a RCU dereference.

There were a couple of more obvious actual source-level conflicts due to
the hlist traversal changes and one just due to code changes next to
each other, but those were trivial.

* 'for-3.9' of git://linux-nfs.org/~bfields/linux: (49 commits)
  SUNRPC: make AF_LOCAL connect synchronous
  nfsd: fix compiler warning about ambiguous types in nfsd_cache_csum
  svcrpc: fix rpc server shutdown races
  svcrpc: make svc_age_temp_xprts enqueue under sv_lock
  lockd: nlmclnt_reclaim(): avoid stack overflow
  nfsd: enable NFSv4 state in containers
  nfsd: disable usermode helper client tracker in container
  nfsd: use proper net while reading "exports" file
  nfsd: containerize NFSd filesystem
  nfsd: fix comments on nfsd_cache_lookup
  SUNRPC: move cache_detail->cache_request callback call to cache_read()
  SUNRPC: remove "cache_request" argument in sunrpc_cache_pipe_upcall() function
  SUNRPC: rework cache upcall logic
  SUNRPC: introduce cache_detail->cache_request callback
  NFS: simplify and clean cache library
  NFS: use SUNRPC cache creation and destruction helper for DNS cache
  nfsd4: free_stid can be static
  nfsd: keep a checksum of the first 256 bytes of request
  sunrpc: trim off trailing checksum before returning decrypted or integrity authenticated buffer
  sunrpc: fix comment in struct xdr_buf definition
  ...
2013-02-28 18:02:55 -08:00