commit aea9d711f3 upstream.
The code that hashes and unhashes connections from the connection table
is missing locking of the connection being modified, which opens up a
race condition and results in memory corruption when this race condition
is hit.
Here is what happens in pretty verbose form:
CPU 0 CPU 1
------------ ------------
An active connection is terminated and
we schedule ip_vs_conn_expire() on this
CPU to expire this connection.
IRQ assignment is changed to this CPU,
but the expire timer stays scheduled on
the other CPU.
New connection from same ip:port comes
in right before the timer expires, we
find the inactive connection in our
connection table and get a reference to
it. We proper lock the connection in
tcp_state_transition() and read the
connection flags in set_tcp_state().
ip_vs_conn_expire() gets called, we
unhash the connection from our
connection table and remove the hashed
flag in ip_vs_conn_unhash(), without
proper locking!
While still holding proper locks we
write the connection flags in
set_tcp_state() and this sets the hashed
flag again.
ip_vs_conn_expire() fails to expire the
connection, because the other CPU has
incremented the reference count. We try
to re-insert the connection into our
connection table, but this fails in
ip_vs_conn_hash(), because the hashed
flag has been set by the other CPU. We
re-schedule execution of
ip_vs_conn_expire(). Now this connection
has the hashed flag set, but isn't
actually hashed in our connection table
and has a dangling list_head.
We drop the reference we held on the
connection and schedule the expire timer
for timeouting the connection on this
CPU. Further packets won't be able to
find this connection in our connection
table.
ip_vs_conn_expire() gets called again,
we think it's already hashed, but the
list_head is dangling and while removing
the connection from our connection table
we write to the memory location where
this list_head points to.
The result will probably be a kernel oops at some other point in time.
This race condition is pretty subtle, but it can be triggered remotely.
It needs the IRQ assignment change or another circumstance where packets
coming from the same ip:port for the same service are being processed on
different CPUs. And it involves hitting the exact time at which
ip_vs_conn_expire() gets called. It can be avoided by making sure that
all packets from one connection are always processed on the same CPU and
can be made harder to exploit by changing the connection timeouts to
some custom values.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit ef1691504c upstream.
Commit 8ccb92ad (netfilter: xt_recent: fix false match) fixed supposedly
false matches in rules using a zero hit_count. As it turns out there is
nothing false about these matches and people are actually using entries
with a hit_count of zero to make rules dependant on addresses inserted
manually through /proc.
Since this slipped past the eyes of three reviewers, instead of
reverting the commit in question, this patch explicitly checks
for a hit_count of zero to make the intentions more clear.
Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Tested-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 2c08522e5d upstream.
e->index overflows e->stamps[] every ip_pkt_list_tot packets.
Consider the case when ip_pkt_list_tot==1; the first packet received is stored
in e->stamps[0] and e->index is initialized to 1. The next received packet
timestamp is then stored at e->stamps[1] in recent_entry_update(),
a buffer overflow because the maximum e->stamps[] index is 0.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit d696c7bdaa upstream.
As noticed by Jon Masters <jonathan@jonmasters.org>, the conntrack hash
size is global and not per namespace, but modifiable at runtime through
/sys/module/nf_conntrack/hashsize. Changing the hash size will only
resize the hash in the current namespace however, so other namespaces
will use an invalid hash size. This can cause crashes when enlarging
the hashsize, or false negative lookups when shrinking it.
Move the hash size into the per-namespace data and only use the global
hash size to initialize the per-namespace value when instanciating a
new namespace. Additionally restrict hash resizing to init_net for
now as other namespaces are not handled currently.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 13ccdfc2af upstream.
Expectation hashtable size was simply glued to a variable with no code
to rehash expectations, so it was a bug to allow writing to it.
Make "expect_hashsize" readonly.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 5b3501faa8 upstream.
nf_conntrack_cachep is currently shared by all netns instances, but
because of SLAB_DESTROY_BY_RCU special semantics, this is wrong.
If we use a shared slab cache, one object can instantly flight between
one hash table (netns ONE) to another one (netns TWO), and concurrent
reader (doing a lookup in netns ONE, 'finding' an object of netns TWO)
can be fooled without notice, because no RCU grace period has to be
observed between object freeing and its reuse.
We dont have this problem with UDP/TCP slab caches because TCP/UDP
hashtables are global to the machine (and each object has a pointer to
its netns).
If we use per netns conntrack hash tables, we also *must* use per netns
conntrack slab caches, to guarantee an object can not escape from one
namespace to another one.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
[Patrick: added unique slab name allocation]
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 9edd7ca0a3 upstream.
As discovered by Jon Masters <jonathan@jonmasters.org>, the "untracked"
conntrack, which is located in the data section, might be accidentally
freed when a new namespace is instantiated while the untracked conntrack
is attached to a skb because the reference count it re-initialized.
The best fix would be to use a seperate untracked conntrack per
namespace since it includes a namespace pointer. Unfortunately this is
not possible without larger changes since the namespace is not easily
available everywhere we need it. For now move the untracked conntrack
initialization to the init_net setup function to make sure the reference
count is not re-initialized and handle cleanup in the init_net cleanup
function to make sure namespaces can exit properly while the untracked
conntrack is in use in other namespaces.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 3a0429292d upstream.
commit d6d3f08b0f
(netfilter: xtables: conntrack match revision 2) does break the
v1 conntrack match iptables-save output in a subtle way.
Problem is as follows:
up = kmalloc(sizeof(*up), GFP_KERNEL);
[..]
/*
* The strategy here is to minimize the overhead of v1 matching,
* by prebuilding a v2 struct and putting the pointer into the
* v1 dataspace.
*/
memcpy(up, info, offsetof(typeof(*info), state_mask));
[..]
*(void **)info = up;
As the v2 struct pointer is saved in the match data space,
it clobbers the first structure member (->origsrc_addr).
Because the _v1 match function grabs this pointer and does not actually
look at the v1 origsrc, run time functionality does not break.
But iptables -nvL (or iptables-save) cannot know that v1 origsrc_addr
has been overloaded in this way:
$ iptables -p tcp -A OUTPUT -m conntrack --ctorigsrc 10.0.0.1 -j ACCEPT
$ iptables-save
-A OUTPUT -p tcp -m conntrack --ctorigsrc 128.173.134.206 -j ACCEPT
(128.173... is the address to the v2 match structure).
To fix this, we take advantage of the fact that the v1 and v2 structures
are identical with exception of the last two structure members (u8 in v1,
u16 in v2).
We extract them as early as possible and prevent the v2 matching function
from looking at those two members directly.
Previously reported by Michel Messerschmidt via Ben Hutchings, also
see Debian Bug tracker #556587.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit aaff23a95a upstream.
As noticed by Dan Carpenter <error27@gmail.com>, update_nl_seq()
currently contains an out of bounds read of the seq_aft_nl array
when looking for the oldest sequence number position.
Fix it to only compare valid positions.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 258c889362 upstream.
Make sure that any otherwise uninitialised fields of usvc are zero.
This has been obvserved to cause a problem whereby the port of
fwmark services may end up as a non-zero value which causes
scheduling of a destination server to fail for persisitent services.
As observed by Deon van der Merwe <dvdm@truteq.co.za>.
This fix suggested by Julian Anastasov <ja@ssi.bg>.
For good measure also zero udest.
Cc: Deon van der Merwe <dvdm@truteq.co.za>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit acc738fe (netfilter: xtables: avoid pointer to self) introduced
an invalid return value in limit_mt_check().
Signed-off-by: Patrick McHardy <kaber@trash.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits)
net/fsl_pq_mdio: add module license GPL
can: fix WARN_ON dump in net/core/rtnetlink.c:rtmsg_ifinfo()
can: should not use __dev_get_by_index() without locks
hisax: remove bad udelay call to fix build error on ARM
ipip: Fix handling of DF packets when pmtudisc is OFF
qlge: Set PCIe reset type for EEH to fundamental.
qlge: Fix early exit from mbox cmd complete wait.
ixgbe: fix traffic hangs on Tx with ioatdma loaded
ixgbe: Fix checking TFCS register for TXOFF status when DCB is enabled
ixgbe: Fix gso_max_size for 82599 when DCB is enabled
macsonic: fix crash on PowerBook 520
NET: cassini, fix lock imbalance
ems_usb: Fix byte order issues on big endian machines
be2net: Bug fix to send config commands to hardware after netdev_register
be2net: fix to set proper flow control on resume
netfilter: xt_connlimit: fix regression caused by zero family value
rt2x00: Don't queue ieee80211 work after USB removal
Revert "ipw2200: fix oops on missing firmware"
decnet: netdevice refcount leak
netfilter: nf_nat: fix NAT issue in 2.6.30.4+
...
Vitezslav Samel discovered that since 2.6.30.4+ active FTP can not work
over NAT. The "cause" of the problem was a fix of unacknowledged data
detection with NAT (commit a3a9f79e36).
However, actually, that fix uncovered a long standing bug in TCP conntrack:
when NAT was enabled, we simply updated the max of the right edge of
the segments we have seen (td_end), by the offset NAT produced with
changing IP/port in the data. However, we did not update the other parameter
(td_maxend) which is affected by the NAT offset. Thus that could drift
away from the correct value and thus resulted breaking active FTP.
The patch below fixes the issue by *not* updating the conntrack parameters
from NAT, but instead taking into account the NAT offsets in conntrack in a
consistent way. (Updating from NAT would be more harder and expensive because
it'd need to re-calculate parameters we already calculated in conntrack.)
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.
Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.
Signed-off-by: David S. Miller <davem@davemloft.net>