Commit Graph

128 Commits

Author SHA1 Message Date
Eric Dumazet
efe4208f47 ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :

To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common

Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.

Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).

inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.

inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.

We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 00:01:25 -04:00
Shawn Bohrer
fbf8866d65 net: ipv4 only populate IP_PKTINFO when needed
The since the removal of the routing cache computing
fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
packet received.  This has introduced a performance regression for some
UDP workloads.

This change skips populating the packet info for sockets that do not have
IP_PKTINFO set.

Benchmark results from a netperf UDP_RR test:
Before 89789.68 transactions/s
After  90587.62 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.63us RTT
After  12.48us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 16:27:33 -04:00
Eric W. Biederman
0bbf87d852 net ipv4: Convert ipv4.ip_local_port_range to be per netns v3
- Move sysctl_local_ports from a global variable into struct netns_ipv4.
- Modify inet_get_local_port_range to take a struct net, and update all
  of the callers.
- Move the initialization of sysctl_local_ports into
   sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c

v2:
- Ensure indentation used tabs
- Fixed ip.h so it applies cleanly to todays net-next

v3:
- Compile fixes of strange callers of inet_get_local_port_range.
  This patch now successfully passes an allmodconfig build.
  Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range

Originally-by: Samya <samya@twitter.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30 21:59:38 -07:00
Francesco Fusco
aa66158145 ipv4: processing ancillary IP_TOS or IP_TTL
If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
packets with the specified TTL or TOS overriding the socket values specified
with the traditional setsockopt().

The struct inet_cork stores the values of TOS, TTL and priority that are
passed through the struct ipcm_cookie. If there are user-specified TOS
(tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
used to override the per-socket values. In case of TOS also the priority
is changed accordingly.

Two helper functions get_rttos and get_rtconn_flags are defined to take
into account the presence of a user specified TOS value when computing
RT_TOS and RT_CONN_FLAGS.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28 15:21:52 -07:00
Francesco Fusco
f02db315b8 ipv4: IP_TOS and IP_TTL can be specified as ancillary data
This patch enables the IP_TTL and IP_TOS values passed from userspace to
be stored in the ipcm_cookie struct. Three fields are added to the struct:

- the TTL, expressed as __u8.
  The allowed values are in the [1-255].
  A value of 0 means that the TTL is not specified.

- the TOS, expressed as __s16.
  The allowed values are in the range [0,255].
  A value of -1 means that the TOS is not specified.

- the priority, expressed as a char and computed when
  handling the ancillary data.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28 15:21:51 -07:00
Joe Perches
5c3a0fd7d0 ip*.h: Remove extern from function prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-21 14:01:38 -04:00
Ansis Atteka
703133de33 ip: generate unique IP identificator if local fragmentation is allowed
If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.

For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.

Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-19 14:11:15 -04:00
Cong Wang
5a17a390de net: make snmp_mib_free static inline
Fengguang reported:

   net/built-in.o: In function `in6_dev_finish_destroy':
   (.text+0x4ca7d): undefined reference to `snmp_mib_free'

this is due to snmp_mib_free() is defined when CONFIG_INET is enabled,
but in6_dev_finish_destroy() is now moved to core kernel.

I think snmp_mib_free() is small enough to be inlined, so just make it
static inline.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-02 21:00:50 -07:00
Steffen Klassert
8141ed9fce ipv4: Add a socket release callback for datagram sockets
This implements a socket release callback function to check
if the socket cached route got invalid during the time
we owned the socket. The function is used from udp, raw
and ping sockets.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 14:17:05 -05:00
Patrick McHardy
5f2d04f1f9 ipv4: fix path MTU discovery with connection tracking
IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and
(in case of forwarded packets) refragments them at POST_ROUTING
independent of the IP_DF flag. Refragmentation uses the dst_mtu() of
the local route without caring about the original fragment sizes,
thereby breaking PMTUD.

This patch fixes this by keeping track of the largest received fragment
with IP_DF set and generates an ICMP fragmentation required error during
refragmentation if that size exceeds the MTU.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
2012-08-26 19:13:55 +02:00
Eric Dumazet
b5ec8eeac4 ipv4: fix ip_send_skb()
ip_send_skb() can send orphaned skb, so we must pass the net pointer to
avoid possible NULL dereference in error path.

Bug added by commit 3a7c384ffd (ipv4: tcp: unicast_sock should not
land outside of TCP stack)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-10 14:08:57 -07:00
Eric Dumazet
be9f4a44e7 ipv4: tcp: remove per net tcp_sock
tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
per network namespace.

This leads to bad behavior on multiqueue NICS, because many cpus
contend for the socket lock and once socket lock is acquired, extra
false sharing on various socket fields slow down the operations.

To better resist to attacks, we use a percpu socket. Each cpu can
run without contention, using appropriate memory (local node)

Additional features :

1) We also mirror the queue_mapping of the incoming skb, so that
answers use the same queue if possible.

2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()

3) We now limit the number of in-flight RST/ACK [1] packets
per cpu, instead of per namespace, and we honor the sysctl_wmem_default
limit dynamically. (Prior to this patch, sysctl_wmem_default value was
copied at boot time, so any further change would not affect tcp_sock
limit)

[1] These packets are only generated when no socket was matched for
the incoming packet.

Reported-by: Bill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 10:35:30 -07:00
David S. Miller
70e7341673 ipv4: Show that ip_send_reply() is purely unicast routine.
Rename it to ip_send_unicast_reply() and add explicit 'saddr'
argument.

This removed one of the few users of rt->rt_spec_dst.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28 03:21:41 -07:00
Alexander Duyck
6648bd7e0e ipv4: Add sysctl knob to control early socket demux
This change is meant to add a control for disabling early socket demux.
The main motivation behind this patch is to provide an option to disable
the feature as it adds an additional cost to routing that reduces overall
throughput by up to 5%.  For example one of my systems went from 12.1Mpps
to 11.6 after the early socket demux was added.  It looks like the reason
for the regression is that we are now having to perform two lookups, first
the one for an established socket, and then the one for the routing table.

By adding this patch and toggling the value for ip_early_demux to 0 I am
able to get back to the 12.1Mpps I was previously seeing.

[ Move local variables in ip_rcv_finish() down into the basic
  block in which they are actually used.  -DaveM ]

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-22 17:11:13 -07:00
Paul Gortmaker
211ed86510 net: delete all instances of special processing for token ring
We are going to delete the Token ring support.  This removes any
special processing in the core networking for token ring, (aside
from net/tr.c itself), leaving the drivers and remaining tokenring
support present but inert.

The mass removal of the drivers and net/tr.c will be in a separate
commit, so that the history of these files that we still care
about won't have the giant deletion tied into their history.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-05-15 20:14:35 -04:00
Eric W. Biederman
a5347fe36b net: Delete all remaining instances of ctl_path
We don't use struct ctl_path anymore so delete the exported constants.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20 21:22:30 -04:00
David S. Miller
ba57b4db26 ipv4: Make ip_call_ra_chain() return bool.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-09 14:34:50 -08:00
Eric Dumazet
dfd56b8b38 net: use IS_ENABLED(CONFIG_IPV6)
Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-11 18:25:16 -05:00
Eric Dumazet
d826eb14ec ipv4: PKTINFO doesnt need dst reference
Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :

> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>

OK I found it, I did some extra tests and believe its ready.

[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference

When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.

We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.

We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.

This removes two atomic operations per packet, and false sharing as
well.

On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.

IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-09 16:36:27 -05:00
Eric Dumazet
66b13d99d9 ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT
There is a long standing bug in linux tcp stack, about ACK messages sent
on behalf of TIME_WAIT sockets.

In the IP header of the ACK message, we choose to reflect TOS field of
incoming message, and this might break some setups.

Example of things that were broken :
  - Routing using TOS as a selector
  - Firewalls
  - Trafic classification / shaping

We now remember in timewait structure the inet tos field and use it in
ACK generation, and route lookup.

Notes :
 - We still reflect incoming TOS in RST messages.
 - We could extend MuraliRaja Muniraju patch to report TOS value in
netlink messages for TIME_WAIT sockets.
 - A patch is needed for IPv6

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-24 03:06:21 -04:00
Eric Dumazet
bc416d9768 macvlan: handle fragmented multicast frames
Fragmented multicast frames are delivered to a single macvlan port,
because ip defrag logic considers other samples are redundant.

Implement a defrag step before trying to send the multicast frame.

Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-18 23:22:07 -04:00
David S. Miller
595fc71baa ipv4: Add ip_defrag() agent IP_DEFRAG_AF_PACKET.
Elide the ICMP on frag queue timeouts unconditionally for
this user.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-05 22:34:52 -07:00
David S. Miller
d18cd551d1 net: Fix build failures due to ip_is_fragment()
It needs to be available even when CONFIG_INET is not set.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-23 21:28:52 -07:00
Paul Gortmaker
56f8a75c17 ip: introduce ip_is_fragment helper inline function
There are enough instances of this:

    iph->frag_off & htons(IP_MF | IP_OFFSET)

that a helper function is probably warranted.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 20:33:34 -07:00
Eric Dumazet
4b9d9be839 inetpeer: remove unused list
Andi Kleen and Tim Chen reported huge contention on inetpeer
unused_peers.lock, on memcached workload on a 40 core machine, with
disabled route cache.

It appears we constantly flip peers refcnt between 0 and 1 values, and
we must insert/remove peers from unused_peers.list, holding a contended
spinlock.

Remove this list completely and perform a garbage collection on-the-fly,
at lookup time, using the expired nodes we met during the tree
traversal.

This removes a lot of code, makes locking more standard, and obsoletes
two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
removes two pointers in inet_peer structure.

There is still a false sharing effect because refcnt is in first cache
line of object [were the links and keys used by lookups are located], we
might move it at the end of inet_peer structure to let this first cache
line mostly read by cpus.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-08 17:05:30 -07:00