alloc_bucket_locks allocation pattern is quite unusual. We are
preferring vmalloc when CONFIG_NUMA is enabled. The rationale is that
vmalloc will respect the memory policy of the current process and so the
backing memory will get distributed over multiple nodes if the requester
is configured properly. At least that is the intention, in reality
rhastable is shrunk and expanded from a kernel worker so no mempolicy
can be assumed.
Let's just simplify the code and use kvmalloc helper, which is a
transparent way to use kmalloc with vmalloc fallback, if the caller is
allowed to block and use the flag otherwise.
Link: http://lkml.kernel.org/r/20170306103032.2540-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By using smaller datatypes this (rather large) struct shrinks considerably
(80 -> 48 bytes on x86_64).
As this is embedded in other structs, this also rerduces size of several
others, e.g. cls_fl_head or nft_hash.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 6d684e5469 ("rhashtable: Cap total number of entries
to 2^31") breaks rhashtable users that do not set max_size. This
is because when max_size is zero max_elems is also incorrectly set
to zero instead of 2^31.
This patch fixes it by only lowering max_elems when max_size is not
zero.
Fixes: 6d684e5469 ("rhashtable: Cap total number of entries to 2^31")
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When max_size is not set or if it set to a sufficiently large
value, the nelems counter can overflow. This would cause havoc
with the automatic shrinking as it would then attempt to fit a
huge number of entries into a tiny hash table.
This patch fixes this by adding max_elems to struct rhashtable
to cap the number of elements. This is set to 2^31 as nelems is
not a precise count. This is sufficiently smaller than UINT_MAX
that it should be safe.
When max_size is set max_elems will be lowered to at most twice
max_size as is the status quo.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
no users in the tree, insecure_max_entries is always set to
ht->p.max_size * 2 in rhtashtable_init().
Replace only spot that uses it with a ht->p.max_size check.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 83e7e4ce9e ("mac80211: Use rhltable instead of rhashtable")
removed the last user that made use of 'insecure_elasticity' parameter,
i.e. the default of 16 is used everywhere.
Replace it with a constant.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current annotation is wrong as it says that we're only called
under spinlock. In fact it should be marked as under either
spinlock or RCU read lock.
Fixes: da20420f83 ("rhashtable: Add nested tables")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter reported a use before NULL check bug in the function
bucket_table_free. In fact we don't need the NULL check at all as
no caller can provide a NULL argument. So this patch fixes this by
simply removing it.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds code that handles GFP_ATOMIC kmalloc failure on
insertion. As we cannot use vmalloc, we solve it by making our
hash table nested. That is, we allocate single pages at each level
and reach our desired table size by nesting them.
When a nested table is created, only a single page is allocated
at the top-level. Lower levels are allocated on demand during
insertion. Therefore for each insertion to succeed, only two
(non-consecutive) pages are needed.
After a nested table is created, a rehash will be scheduled in
order to switch to a vmalloced table as soon as possible. Also,
the rehash code will never rehash into a nested table. If we
detect a nested table during a rehash, the rehash will be aborted
and a new rehash will be scheduled.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The insecure_elasticity setting is an ugly wart brought out by
users who need to insert duplicate objects (that is, distinct
objects with identical keys) into the same table.
In fact, those users have a much bigger problem. Once those
duplicate objects are inserted, they don't have an interface to
find them (unless you count the walker interface which walks
over the entire table).
Some users have resorted to doing a manual walk over the hash
table which is of course broken because they don't handle the
potential existence of multiple hash tables. The result is that
they will break sporadically when they encounter a hash table
resize/rehash.
This patch provides a way out for those users, at the expense
of an extra pointer per object. Essentially each object is now
a list of objects carrying the same key. The hash table will
only see the lists so nothing changes as far as rhashtable is
concerned.
To use this new interface, you need to insert a struct rhlist_head
into your objects instead of struct rhash_head. While the hash
table is unchanged, for type-safety you'll need to use struct
rhltable instead of struct rhashtable. All the existing interfaces
have been duplicated for rhlist, including the hash table walker.
One missing feature is nulls marking because AFAIK the only potential
user of it does not need duplicate objects. Should anyone need
this it shouldn't be too hard to add.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.
More specifically, they are:
1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
transport area in dccp, sctp, tcp, udp and udplite protocol
conntrackers, from Gao Feng.
2) Missing whitespace on error message in physdev match, from Hangbin Liu.
3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.
4) Add nf_ct_expires() helper function and use it, from Florian Westphal.
5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
from Florian.
6) Rename nf_tables set implementation to nft_set_{name}.c
7) Introduce the hash expression to allow arbitrary hashing of selector
concatenations, from Laura Garcia Liebana.
8) Remove ip_conntrack sysctl backward compatibility code, this code has
been around for long time already, and we have two interfaces to do
this already: nf_conntrack sysctl and ctnetlink.
9) Use nf_conntrack_get_ht() helper function whenever possible, instead
of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.
10) Add quota expression for nf_tables.
11) Add number generator expression for nf_tables, this supports
incremental and random generators that can be combined with maps,
very useful for load balancing purpose, again from Laura Garcia Liebana.
12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.
13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
configuration, this is used by a follow up patch to perform better chain
update validation.
14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
nft_set_hash implementation to honor the NLM_F_EXCL flag.
15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
patch from Florian Westphal.
16) Don't use the DYING bit to know if the conntrack event has been already
delivered, instead a state variable to track event re-delivery
states, also from Florian.
17) Remove the per-conntrack timer, use the workqueue approach that was
discussed during the NFWS, from Florian Westphal.
18) Use the netlink conntrack table dump path to kill stale entries,
again from Florian.
19) Add a garbage collector to get rid of stale conntracks, from
Florian.
20) Reschedule garbage collector if eviction rate is high.
21) Get rid of the __nf_ct_kill_acct() helper.
22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.
23) Make nf_log_set() interface assertive on unsupported families.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch modifies __rhashtable_insert_fast() so it returns the
existing object that clashes with the one that you want to insert.
In case the object is successfully inserted, NULL is returned.
Otherwise, you get an error via ERR_PTR().
This patch adapts the existing callers of __rhashtable_insert_fast()
so they handle this new logic, and it adds a new
rhashtable_lookup_get_insert_key() interface to fetch this existing
object.
nf_tables needs this change to improve handling of EEXIST cases via
honoring the NLM_F_EXCL flag and by checking if the data part of the
mapping matches what we have.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
The commit 8f6fd83c6c ("rhashtable:
accept GFP flags in rhashtable_walk_init") added a GFP flag argument
to rhashtable_walk_init because some users wish to use the walker
in an unsleepable context.
In fact we don't need to allocate memory in rhashtable_walk_init
at all. The walker is always paired with an iterator so we could
just stash ourselves there.
This patch does that by introducing a new enter function to replace
the existing init function. This way we don't have to churn all
the existing users again.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sander reports following splat after netfilter nat bysrc table got
converted to rhashtable:
swapper/0: page allocation failure: order:3, mode:0x2084020(GFP_ATOMIC|__GFP_COMP)
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc1 [..]
[<ffffffff811633ed>] warn_alloc_failed+0xdd/0x140
[<ffffffff811638b1>] __alloc_pages_nodemask+0x3e1/0xcf0
[<ffffffff811a72ed>] alloc_pages_current+0x8d/0x110
[<ffffffff8117cb7f>] kmalloc_order+0x1f/0x70
[<ffffffff811aec19>] __kmalloc+0x129/0x140
[<ffffffff8146d561>] bucket_table_alloc+0xc1/0x1d0
[<ffffffff8146da1d>] rhashtable_insert_rehash+0x5d/0xe0
[<ffffffff819fcfff>] nf_nat_setup_info+0x2ef/0x400
The failure happens when allocating the spinlock array.
Even with GFP_KERNEL its unlikely for such a large allocation
to succeed.
Thomas Graf pointed me at inet_ehash_locks_alloc(), so in addition
to adding NOWARN for atomic allocations this also makes the bucket-array
sizing more conservative.
In commit 095dc8e0c3 ("tcp: fix/cleanup inet_ehash_locks_alloc()"),
Eric Dumazet says: "Budget 2 cache lines per cpu worth of 'spinlocks'".
IOW, consider size needed by a single spinlock when determining
number of locks per cpu. So with 64 byte per cacheline and 4 byte per
spinlock this gives 32 locks per cpu.
Resulting size of the lock-array (sizeof(spinlock) == 4):
cpus: 1 2 4 8 16 32 64
old: 1k 1k 4k 8k 16k 16k 16k
new: 128 256 512 1k 2k 4k 8k
8k allocation should have decent chance of success even
with GFP_ATOMIC, and should not fail with GFP_KERNEL.
With 72-byte spinlock (LOCKDEP):
cpus : 1 2
old: 9k 18k
new: ~2k ~4k
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
In certain cases, the 802.11 mesh pathtable code wants to
iterate over all of the entries in the forwarding table from
the receive path, which is inside an RCU read-side critical
section. Enable walks inside atomic sections by allowing
GFP_ATOMIC allocations for the walker state.
Change all existing callsites to pass in GFP_KERNEL.
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
[also adjust gfs2/glock.c and rhashtable tests]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The commit c6ff526829 ("rhashtable:
Fix walker list corruption") causes a suspicious RCU usage warning
because we no longer hold ht->mutex when we dereference ht->tbl.
However, this is a false positive because we now hold ht->lock
which also guarantees that ht->tbl won't disppear from under us.
This patch kills the warning by using rcu_dereference_protected.
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/geneve.c
Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit ba7c95ea38 ("rhashtable:
Fix sleeping inside RCU critical section in walk_stop") introduced
a new spinlock for the walker list. However, it did not convert
all existing users of the list over to the new spin lock. Some
continued to use the old mutext for this purpose. This obviously
led to corruption of the list.
The fix is to use the spin lock everywhere where we touch the list.
This also allows us to do rcu_rad_lock before we take the lock in
rhashtable_walk_start. With the old mutex this would've deadlocked
but it's safe with the new spin lock.
Fixes: ba7c95ea38 ("rhashtable: Fix sleeping inside RCU...")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
William Hua <william.hua@canonical.com> wrote:
>
> I wasn't aware there was an enforced minimum size. I simply set the
> nelem_hint in the rhastable_params struct to 1, expecting it to grow as
> needed. This caused a segfault afterwards when trying to insert an
> element.
OK we're doing the size computation before we enforce the limit
on min_size.
---8<---
We need to do the initial hash table size computation after we
have obtained the correct min_size/max_size parameters. Otherwise
we may end up with a hash table whose size is outside the allowed
envelope.
Fixes: a998f712f7 ("rhashtable: Round up/down min/max_size to...")
Reported-by: William Hua <william.hua@canonical.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch 9497df88ab ("rhashtable:
Fix reader/rehash race") added a pair of barriers. In fact the
wmb is superfluous because every subsequent write to the old or
new hash table uses rcu_assign_pointer, which itself carriers a
full barrier prior to the assignment.
Therefore we may remove the explicit wmb.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>