The request_sock_queue's listen_opt is either vmalloc-ed or
kmalloc-ed depending on the number of table entries. Thus it
is expected to be handled properly on free, which is done in
the reqsk_queue_destroy().
However the error path in inet_csk_listen_start() calls
the lite version of reqsk_queue_destroy, called
__reqsk_queue_destroy, which calls the kfree unconditionally.
Fix this and move the __reqsk_queue_destroy into a .c file as
it looks too big to be inline.
As David also noticed, this is an error recovery path only,
so no locking is required and the lopt is known to be not NULL.
reqsk_queue_yank_listen_sk is also now only used in
net/core/request_sock.c so we should move it there too.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We round up the header size in sk_stream_alloc_pskb so that
TSO packets get zero tail room. Unfortunately this rounding
up is not coordinated with the select_size() function used by
TCP to calculate the second parameter of sk_stream_alloc_pskb.
As a result, we may allocate more than a page of data in the
non-TSO case when exactly one page is desired.
In fact, rounding up the head room is detrimental in the non-TSO
case because it makes memory that would otherwise be available to
the payload head room. TSO doesn't need this either, all it wants
is the guarantee that there is no tail room.
So this patch fixes this by adjusting the skb_reserve call so that
exactly the requested amount (which all callers have calculated in
a precise way) is made available as tail room.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reverts Eric's commit 2b008b0a8e
It diets .text & .data section of the kernel if CONFIG_NET_NS is not set.
This is safe after list operations cleanup.
Signed-of-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The inetpeer.c tracks the LRU list of inet_perr-s, but makes
it by hands. Use the list_head-s for this.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a small memory leak. Default fib rules can be deleted by
the user if the rule does not carry FIB_RULE_PERMANENT flag, f.e. by
ip rule flush
Such a rule will not be freed as the ref-counter has 2 on start and becomes
clearly unreachable after removal.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
This counter is _always_ modified under the unix_gc_lock spinlock,
so its atomicity can be provided w/o additional efforts.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver operations set_ieee8021x(), set_port_auth() and
set_privacy_invoked() are not used by any drivers, except
set_privacy_invoked() they aren't even used by mac80211.
Remove them at least until we need to support drivers with
mac80211 that require getting this information.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Michael Wu <flamingice@sourmilk.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This allows a driver to ask for a specific rate control algorithm.
The rate control algorithm asked for must be registered and be
available as a module or built-in.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There are many places that get the dst entry, increase the
__use counter and set the "lastuse" time stamp.
Make a helper for this.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP-AUTH and future ADD-IP updates have a requirement to
do additional verification of parameters and an ability to
ABORT the association if verification fails. So, introduce
additional return code so that we can clear signal a required
action.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
This patch adds a tunable that will allow ADD_IP to work without
AUTH for backward compatibility. The default value is off since
the default value for ADD_IP is off as well. People who need
to use ADD-IP with older implementations take risks of connection
hijacking and should consider upgrading or turning this tunable on.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
After learning more about rcu, it looks like the ADD-IP hadling
doesn't need to call call_rcu_bh. All the rcu critical sections
use rcu_read_lock, so using call_rcu_bh is wrong here.
Now, restore the local_bh_disable() code blocks and use normal
call_rcu() calls. Also restore the missing return statement.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Commit d0ce92910b broke several retransmit
cases including fast retransmit. The reason is that we should
only delay by rto while doing retranmists as a result of a timeout.
Retransmit as a result of path mtu discover, fast retransmit, or
other evernts that should trigger immidiate retransmissions got broken.
Also, since rto is doubled prior to marking of packets elegable for
retransmission, we never marked correct chunks anyway.
The fix is provide a reason for a given retransmission so that we
can mark chunks appropriately and to save the old rto value to do
comparisons against.
All regressions tests passed with this code.
Spotted by Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
As done two years ago on IP route cache table (commit
22c047ccbc) , we can avoid using one
lock per hash bucket for the huge TCP/DCCP hash tables.
On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
litle performance differences. (we hit a different cache line for the
rwlock, but then the bucket cache line have a better sharing factor
among cpus, since we dirty it less often). For netstat or ss commands
that want a full scan of hash table, we perform fewer memory accesses.
Using a 'small' table of hashed rwlocks should be more than enough to
provide correct SMP concurrency between different buckets, without
using too much memory. Sizing of this table depends on
num_possible_cpus() and various CONFIG settings.
This patch provides some locking abstraction that may ease a future
work using a different model for TCP/DCCP table.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the master daemon to sync the connection when it is about
to close. This makes the connections on the backup to close or timeout
according their state. Before the sync was performed only if the
connection is in ESTABLISHED state which always made the connections to
timeout in the hard coded 3 minutes. However the Andy Gospodarek's patch
([IPVS]: use proper timeout instead of fixed value) effectively did nothing
more than increasing this to 15 minutes (Established state timeout). So
this patch makes use of proper timeout since it syncs the connections on
status changes to FIN_WAIT (2min timeout) and CLOSE (10sec timeout).
However if the backup misses CLOSE hopefully it did not miss FIN_WAIT.
Otherwise we will just have to wait for the ESTABLISHED state timeout. As
it is without this patch. This way the number of the hanging connections
on the backup is kept to minimum. And very few of them will be left to
timeout with a long timeout.
This is important if we want to make use of the fix for the real server
overcommit on master/backup fail-over.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the problem with node overload on director fail-over.
Given the scenario: 2 nodes each accepting 3 connections at a time and 2
directors, director failover occurs when the nodes are fully loaded (6
connections to the cluster) in this case the new director will assign
another 6 connections to the cluster, If the same real servers exist
there.
The problem turned to be in not binding the inherited connections to
the real servers (destinations) on the backup director. Therefore:
"ipvsadm -l" reports 0 connections:
root@test2:~# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP test2.local:5999 wlc
-> node473.local:5999 Route 1000 0 0
-> node484.local:5999 Route 1000 0 0
while "ipvs -lnc" is right
root@test2:~# ipvsadm -lnc
IPVS connection entries
pro expire state source virtual destination
TCP 14:56 ESTABLISHED 192.168.0.10:39164 192.168.0.222:5999
192.168.0.51:5999
TCP 14:59 ESTABLISHED 192.168.0.10:39165 192.168.0.222:5999
192.168.0.52:5999
So the patch I am sending fixes the problem by binding the received
connections to the appropriate service on the backup director, if it
exists, else the connection will be handled the old way. So if the
master and the backup directors are synchronized in terms of real
services there will be no problem with server over-committing since
new connections will not be created on the nonexistent real services
on the backup. However if the service is created later on the backup,
the binding will be performed when the next connection update is
received. With this patch the inherited connections will show as
inactive on the backup:
root@test2:~# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP test2.local:5999 wlc
-> node473.local:5999 Route 1000 0 1
-> node484.local:5999 Route 1000 0 1
rumen@test2:~$ cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP C0A800DE:176F wlc
-> C0A80033:176F Route 1000 0 1
-> C0A80032:176F Route 1000 0 1
Regards,
Rumen Bogdanovski
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
There are places that check for CONFIG_IP_MULTIPLE_TABLES
twice in the same file, but the internals of these #ifdefs
can be merged.
As a side effect - remove one ifdef from inside a function.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
"struct proto" currently uses an array stats[NR_CPUS] to track change on
'inuse' sockets per protocol.
If NR_CPUS is big, this means we use a big memory area for this.
Moreover, all this memory area is located on a single node on NUMA
machines, increasing memory pressure on the boot node.
In this patch, I tried to :
- Keep a fast !CONFIG_SMP implementation
- Keep a fast CONFIG_SMP implementation for often used protocols
(tcp,udp,raw,...)
- Introduce a NUMA efficient implementation
Some helper macros are defined in include/net/sock.h
These macros take into account CONFIG_SMP
If a "struct proto" is declared without using DEFINE_PROTO_INUSE /
REF_PROTO_INUSE
macros, it will automatically use a default implementation, using a
dynamically allocated percpu zone.
This default implementation will be NUMA efficient, but might use 32/64
bytes per possible cpu
because of current alloc_percpu() implementation.
However it still should be better than previous implementation based on
stats[NR_CPUS] field.
When a "struct proto" is changed to use the new macros, we use a single
static "int" percpu variable,
lowering the memory and cpu costs, still preserving NUMA efficiency.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not architecture specific code should not #include <asm/scatterlist.h>.
This patch therefore either replaces them with
#include <linux/scatterlist.h> or simply removes them if they were
unused.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>