[ Upstream commit 6aafeef03b ]
Pushing original fragments through causes several problems. For example
for matching, frags may not be matched correctly. Take following
example:
<example>
On HOSTA do:
ip6tables -I INPUT -p icmpv6 -j DROP
ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT
and on HOSTB you do:
ping6 HOSTA -s2000 (MTU is 1500)
Incoming echo requests will be filtered out on HOSTA. This issue does
not occur with smaller packets than MTU (where fragmentation does not happen)
</example>
As was discussed previously, the only correct solution seems to be to use
reassembled skb instead of separete frags. Doing this has positive side
effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams
dances in ipvs and conntrack can be removed.
Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c
entirely and use code in net/ipv6/reassembly.c instead.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 703133de33 ]
If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.
For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.
Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Make sure that SCTP ports are writable when embedded in ICMP
from client, so that ip_vs_nat_icmp can translate them safely.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The entry struct has a 2 byte hole after ->port and another 4 byte
hole after ->stats.outpkts. You must have CAP_NET_ADMIN in your
namespace to hit this information leak.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
kfree_rcu() requires offsetof(..., rcu_head) < 4096, which can
get violated with a sufficiently high CONFIG_IP_VS_SH_TAB_BITS.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Expire cached connection for new TCP/SCTP connection if real
server is down. Otherwise, IPVS uses the dead server for the
reused connection, instead of a new working one.
Signed-off-by: Grzegorz Lyczba <grzegorz.lyczba@gmail.com>
Acked-by: Hans Schillstrom <hans@schillstrom.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
drivers/net/ethernet/emulex/benet/be.h
include/net/tcp.h
net/mac802154/mac802154.h
Most conflicts were minor overlapping stuff.
The be2net driver brought in some fixes that added __vlan_put_tag
calls, which in net-next take an additional argument.
Signed-off-by: David S. Miller <davem@davemloft.net>
struct ip_vs_sync_mesg and ip_vs_sync_mesg_v0 are both sent across the wire
and used internally to store IPVS synchronisation messages.
Up until now the scheme used has been to convert the size field
to network byte order before sending a message on the wire and
convert it to host byte order when sending a message.
This patch changes that scheme to always treat the field
as being network byte order. This seems appropriate as
the structure is sent across the wire. And by consistently
treating the field has network byte order it is now possible
to take advantage of sparse to flag any future miss-use.
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
The sctp_events[] come from sch->type in set_sctp_state(). They are
between 0-255 so that means we need 256 elements in the array.
I believe that because of how the code is aligned there is normally a
hole after sctp_events[] so this patch doesn't actually change anything.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
There are two motivations for this:
1. It improves readability to my eyes
2. Using nested min() calls results in a shadowed _min1 variable,
which is a bit untidy. Sparse complained about this.
I have also replaced (size_t)64 with a variable of type size_t and value 64.
This also improves readability to my eyes.
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Some service fields are in network order:
- netmask: used once in network order and also as prefix len for IPv6
- port
Other parameters are in host order:
- struct ip_vs_flags: flags and mask moved between user and kernel only
- sync state: moved between user and kernel only
- syncid: sent over network as single octet
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
kbuild test robot reports for sparse warnings in
commits c2a4ffb70e ("ipvs: convert lblc scheduler to rcu")
and c5549571f9 ("ipvs: convert lblcr scheduler to rcu").
Fix it by removing extra __rcu annotation.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
- RCU annotations for ip_vs_info_seq_start and _stop
- __percpu for cpustats
- properly dereference svc->pe in ip_vs_genl_fill_service
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
kbuild test robot reports for sparse warnings
in commit 088339a57d ("ipvs: convert connection locking"):
net/netfilter/ipvs/ip_vs_conn.c:962:13: warning: context imbalance
in 'ip_vs_conn_array' - wrong count at exit
include/linux/rcupdate.h:326:30: warning: context imbalance in
'ip_vs_conn_seq_next' - unexpected unlock
include/linux/rcupdate.h:326:30: warning: context imbalance in
'ip_vs_conn_seq_stop' - unexpected unlock
Fix it by running ip_vs_conn_array under RCU lock
to avoid conditional locking and by adding proper RCU
annotations.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Use rcu_dereference_protected to resolve
sparse warning, found by kbuild test robot:
net/netfilter/ipvs/ip_vs_ctl.c:1464:35: warning: dereference of
noderef expression
Problem from commit 026ace060d
("ipvs: optimize dst usage for real server")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
We used a global BH disable in LOCAL_OUT hook.
Add _bh suffix to all places that need it and remove
the disabling from LOCAL_OUT and sync code.
Functions like ip_defrag need protection from
BH, so add it. As for nf_nat_mangle_tcp_packet, it needs
RCU lock.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This is the final step in RCU conversion.
Things that are removed:
- svc->usecnt: now svc is accessed under RCU read lock
- svc->inc: and some unused code
- ip_vs_bind_pe and ip_vs_unbind_pe: no ability to replace PE
- __ip_vs_svc_lock: replaced with RCU
- IP_VS_WAIT_WHILE: now readers lookup svcs and dests under
RCU and work in parallel with configuration
Other changes:
- before now, a RCU read-side critical section included the
calling of the schedule method, now it is extended to include
service lookup
- ip_vs_svc_table and ip_vs_svc_fwm_table are now using hlist
- svc->pe and svc->scheduler remain to the end (of grace period),
the schedulers are prepared for such RCU readers
even after done_service is called but they need
to use synchronize_rcu because last ip_vs_scheduler_put
can happen while RCU read-side critical sections
use an outdated svc->scheduler pointer
- as planned, update_service is removed
- empty services can be freed immediately after grace period.
If dests were present, the services are freed from
the dest trash code
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
In previous commits the schedulers started to access
svc->destinations with _rcu list traversal primitives
because the IP_VS_WAIT_WHILE macro still plays the role of
grace period. Now it is time to finish the updating part,
i.e. adding and deleting of dests with _rcu suffix before
removing the IP_VS_WAIT_WHILE in next commit.
We use the same rule for conns as for the
schedulers: dests can be searched in RCU read-side critical
section where ip_vs_dest_hold can be called by ip_vs_bind_dest.
Some things are not perfect, for example, calling
functions like ip_vs_lookup_dest from updating code under
RCU, just because we use some function both from reader
and from updater.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This method releases the scheduler state,
it can not fail. Such change will help to properly
replace the scheduler in following patch.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
All dests will go to trash, no exceptions.
But we have to use new list node t_list for this, due
to RCU changes in following patches. Dests will wait there
initial grace period and later all conns and schedulers to
put their reference. The dests don't get reference for
staying in dest trash as before.
As result, we do not load ip_vs_dest_put with
extra checks for last refcnt and the schedulers do not
need to play games with atomic_inc_not_zero while
selecting best destination.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The schedule method now needs _rcu list-traversal
primitive for svc->destinations. As the weight for some
dest can be reduced during dest selection, change the
algorithm to check weights by using minimum weights in the
1 .. max_weight-(di-1) range, with the same step (di). By this
way we ensure that there will be always a weight >= 1 check
before claiming that all destinations are overloaded.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The schedule method now needs _rcu list-traversal
primitive for svc->destinations.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>