Use the macros defined for the members of flowi to clean the code up.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each net_device contains address family specific data such as
per device settings and statistics. We already expose this data
via procfs/sysfs and partially netlink.
The netlink method requires the requester to send one RTM_GETLINK
request for each address family it wishes to receive data of
and then merge this data itself.
This patch implements a new API which combines all address family
specific link data in a new netlink attribute IFLA_AF_SPEC.
IFLA_AF_SPEC contains a sequence of nested attributes, one for each
address family which in turn defines the structure of its own
attribute. Example:
[IFLA_AF_SPEC] = {
[AF_INET] = {
[IFLA_INET_CONF] = ...,
},
[AF_INET6] = {
[IFLA_INET6_FLAGS] = ...,
[IFLA_INET6_CONF] = ...,
}
}
The API also allows for address families to implement a function
which parses the IFLA_AF_SPEC attribute sent by userspace to
implement address family specific link options.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP sockets refcount is usually 2, unless an incoming frame is going to
be queued in receive or backlog queue.
Using atomic_inc_not_zero_hint() permits to reduce latency, because
processor issues less memory transactions.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The changed functions do not modify the NL messages and/or attributes
at all. They should use const (similar to strchr), so that callers
which have a const nlmsg/nlattr around can make use of them without
casting.
While at it, constify a data array.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The GRE Key field is intended to be used for identifying an individual
traffic flow within a tunnel. It is useful to be able to have XFRM
policy selector matches to have different policies for different
GRE tunnels.
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should be enabling country IE hints for WIPHY_FLAG_STRICT_REGULATORY
even if we haven't yet recieved regulatory domain hint for the driver
if it needed one. Without this Country IEs are not passed on to drivers
that have set WIPHY_FLAG_STRICT_REGULATORY, today this is just all
Atheros chipset drivers: ath5k, ath9k, ar9170, carl9170.
This was part of the original design, however it was completely
overlooked...
Cc: Easwar Krishnan <easwar.krishnan@atheros.com>
Cc: stable@kernel.org
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (66 commits)
can-bcm: fix minor heap overflow
gianfar: Do not call device_set_wakeup_enable() under a spinlock
ipv6: Warn users if maximum number of routes is reached.
docs: Add neigh/gc_thresh3 and route/max_size documentation.
axnet_cs: fix resume problem for some Ax88790 chip
ipv6: addrconf: don't remove address state on ifdown if the address is being kept
tcp: Don't change unlocked socket state in tcp_v4_err().
x25: Prevent crashing when parsing bad X.25 facilities
cxgb4vf: add call to Firmware to reset VF State.
cxgb4vf: Fail open if link_start() fails.
cxgb4vf: flesh out PCI Device ID Table ...
cxgb4vf: fix some errors in Gather List to skb conversion
cxgb4vf: fix bug in Generic Receive Offload
cxgb4vf: don't implement trivial (and incorrect) ndo_select_queue()
ixgbe: Look inside vlan when determining offload protocol.
bnx2x: Look inside vlan when determining checksum proto.
vlan: Add function to retrieve EtherType from vlan packets.
virtio-net: init link state correctly
ucc_geth: Fix deadlock
ucc_geth: Do not bring the whole IF down when TX failure.
...
in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).
This can easily be converted to a RCU protection.
Writers hold RTNL, so mc_list_lock is removed, not replaced by a
spinlock.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cypher Wu <cypher.w@gmail.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we test rt->fl.iif against zero, we're seeing if it's
an output or an input route.
Make that explicit with some helper functions.
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems idev field in struct rtable has no special purpose, but adding
extra atomic ops.
We hold refcounts on the device itself (using percpu data, so pretty
cheap in current kernel).
infiniband case is solved using dst.dev instead of idev->dev
Removal of this field means routing without route cache is now using
shared data, percpu data, and only potential contention is a pair of
atomic ops on struct neighbour per forwarded packet.
About 5% speedup on routing test.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is important to move nud_state outside of the often modified cache
line (because of refcnt), to reduce false sharing in neigh_event_send()
This is a followup of commit 0ed8ddf404 (neigh: Protect neigh->ha[]
with a seqlock)
This gives a 7% speedup on routing test with IP route cache disabled.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robin Holt tried to boot a 16TB machine and found some limits were
reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]
We can switch infrastructure to use long "instead" of "int", now
atomic_long_t primitives are available for free.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Robin Holt <holt@sgi.com>
Reviewed-by: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While tracking dev_base_lock users, I found decnet used it in
dnet_select_source(), but for a wrong purpose:
Writers only hold RTNL, not dev_base_lock, so readers must use RCU if
they cannot use RTNL.
Adds an rcu_head in struct dn_ifaddr and handle proper RCU management.
Adds __rcu annotation in dn_route as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Presently the b43legacy build fails on an sh randconfig:
In file included from include/net/dst.h:12,
from drivers/net/wireless/b43legacy/xmit.c:32:
include/net/dst_ops.h:28: error: expected ':', ',', ';', '}' or '__attribute__' before '____cacheline_aligned_in_smp'
include/net/dst_ops.h: In function 'dst_entries_get_fast':
include/net/dst_ops.h:33: error: 'struct dst_ops' has no member named 'pcpuc_entries'
include/net/dst_ops.h: In function 'dst_entries_get_slow':
include/net/dst_ops.h:41: error: 'struct dst_ops' has no member named 'pcpuc_entries'
include/net/dst_ops.h: In function 'dst_entries_add':
include/net/dst_ops.h:49: error: 'struct dst_ops' has no member named 'pcpuc_entries'
include/net/dst_ops.h: In function 'dst_entries_init':
include/net/dst_ops.h:55: error: 'struct dst_ops' has no member named 'pcpuc_entries'
include/net/dst_ops.h: In function 'dst_entries_destroy':
include/net/dst_ops.h:60: error: 'struct dst_ops' has no member named 'pcpuc_entries'
make[5]: *** [drivers/net/wireless/b43legacy/xmit.o] Error 1
make[5]: *** Waiting for unfinished jobs....
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (41 commits)
inet_diag: Make sure we actually run the same bytecode we audited.
netlink: Make nlmsg_find_attr take a const nlmsghdr*.
fib: fib_result_assign() should not change fib refcounts
netfilter: ip6_tables: fix information leak to userspace
cls_cgroup: Fix crash on module unload
memory corruption in X.25 facilities parsing
net dst: fix percpu_counter list corruption and poison overwritten
rds: Remove kfreed tcp conn from list
rds: Lost locking in loop connection freeing
de2104x: fix panic on load
atl1 : fix panic on load
netxen: remove unused firmware exports
caif: Remove noisy printout when disconnecting caif socket
caif: SPI-driver bugfix - incorrect padding.
caif: Bugfix for socket priority, bindtodev and dbg channel.
smsc911x: Set Ethernet EEPROM size to supported device's size
ipv4: netfilter: ip_tables: fix information leak to userland
ipv4: netfilter: arp_tables: fix information leak to userland
cxgb4vf: remove call to stop TX queues at load time.
cxgb4: remove call to stop TX queues at load time.
...
This will let us use it on a nlmsghdr stored inside a netlink_callback.
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes:
o Bugfix: SO_PRIORITY for SOL_SOCKET could not be handled
in caif's setsockopt, using the struct sock attribute priority instead.
o Bugfix: SO_BINDTODEVICE for SOL_SOCKET could not be handled
in caif's setsockopt, using the struct sock attribute ifindex instead.
o Wrong assert statement for RFM layer segmentation.
o CAIF Debug channels was not working over SPI, caif_payload_info
containing padding info must be initialized.
o Check on pointer before dereferencing when unregister dev in caif_dev.c
Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits)
b43: Fix warning at drivers/mmc/core/core.c:237 in mmc_wait_for_cmd
mac80211: fix failure to check kmalloc return value in key_key_read
libertas: Fix sd8686 firmware reload
ath9k: Fix incorrect access of rate flags in RC
netfilter: xt_socket: Make tproto signed in socket_mt6_v1().
stmmac: enable/disable rx/tx in the core with a single write.
net: atarilance - flags should be unsigned long
netxen: fix kdump
pktgen: Limit how much data we copy onto the stack.
net: Limit socket I/O iovec total length to INT_MAX.
USB: gadget: fix ethernet gadget crash in gether_setup
fib: Fix fib zone and its hash leak on namespace stop
cxgb3: Fix panic in free_tx_desc()
cxgb3: fix crash due to manipulating queues before registration
8390: Don't oops on starting dev queue
dccp ccid-2: Stop polling
dccp: Refine the wait-for-ccid mechanism
dccp: Extend CCID packet dequeueing interface
dccp: Return-value convention of hc_tx_send_packet()
igbvf: fix panic on load
...
When we stop a namespace we flush the table and free one, but the
added fn_zone-s (and their hashes if grown) are leaked. Need to free.
Tries releases all its stuff in the flushing code.
Shame on us - this bug exists since the very first make-fib-per-net
patches in 2.6.27 :(
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>