With the recent changes for how we compute the skb truesize it occurs to me
we are probably going to have a lot of calls to skb_end_pointer -
skb->head. Instead of running all over the place doing that it would make
more sense to just make it a separate inline skb_end_offset(skb) that way
we can return the correct value without having gcc having to do all the
optimization to cancel out skb->head - skb->head.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With multiple cards is hard to figure out which port caused trouble
int the layer2 routines (e.g. got a timeout).
Now we have the informations in the log output.
Signed-off-by: Karsten Keil <kkeil@linux-pingi.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
For certification test it is very useful to change the layer1
timer3 value on runtime.
Signed-off-by: Karsten Keil <kkeil@linux-pingi.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be full preemptiv safe, we cannot handle a L2 timeout in the timer
context itself, we should do all actions via the D-channel thread.
Signed-off-by: Karsten Keil <kkeil@linux-pingi.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for a skb_head_is_locked helper function. It is
meant to be used any time we are considering transferring the head from
skb->head to a paged frag. If the head is locked it means we cannot remove
the head from the skb so it must be copied or we must take the skb as a
whole.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed. When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements RFC 5827 early retransmit (ER) for TCP.
It reduces DUPACK threshold (dupthresh) if outstanding packets are
less than 4 to recover losses by fast recovery instead of timeout.
While the algorithm is simple, small but frequent network reordering
makes this feature dangerous: the connection repeatedly enter
false recovery and degrade performance. Therefore we implement
a mitigation suggested in the appendix of the RFC that delays
entering fast recovery by a small interval, i.e., RTT/4. Currently
ER is conservative and is disabled for the rest of the connection
after the first reordering event. A large scale web server
experiment on the performance impact of ER is summarized in
section 6 of the paper "Proportional Rate Reduction for TCP”,
IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
Note that Linux has a similar feature called THIN_DUPACK. The
differences are THIN_DUPACK do not mitigate reorderings and is only
used after slow start. Currently ER is disabled if THIN_DUPACK is
enabled. I would be happy to merge THIN_DUPACK feature with ER if
people think it's a good idea.
ER is enabled by sysctl_tcp_early_retrans:
0: Disables ER
1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
2: (Default) reduce dupthresh like mode 1. In addition, delay
entering fast recovery by RTT/4.
Note: mode 2 is implemented in the third part of this patch series.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove useless casts and rename variables for less confusion.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
L2TPv3 defines an IP encapsulation packet format where data is carried
directly over IP (no UDP). The kernel already has support for L2TP IP
encapsulation over IPv4 (l2tp_ip). This patch introduces support for
L2TP IP encapsulation over IPv6.
The implementation is derived from ipv6/raw and ipv4/l2tp_ip.
Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for unmanaged L2TPv3 tunnels over IPv6 using
the netlink API. We already support unmanaged L2TPv3 tunnels over
IPv4. A patch to iproute2 to make use of this feature will be
submitted separately.
Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Checkpatch warns about the use of __attribute__((packed)). So use the
recommended __packed syntax instead.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->head is currently allocated from kmalloc(). This is convenient but
has the drawback the data cannot be converted to a page fragment if
needed.
We have three spots were it hurts :
1) GRO aggregation
When a linear skb must be appended to another skb, GRO uses the
frag_list fallback, very inefficient since we keep all struct sk_buff
around. So drivers enabling GRO but delivering linear skbs to network
stack aren't enabling full GRO power.
2) splice(socket -> pipe).
We must copy the linear part to a page fragment.
This kind of defeats splice() purpose (zero copy claim)
3) TCP coalescing.
Recently introduced, this permits to group several contiguous segments
into a single skb. This shortens queue lengths and save kernel memory,
and greatly reduce probabilities of TCP collapses. This coalescing
doesnt work on linear skbs (or we would need to copy data, this would be
too slow)
Given all these issues, the following patch introduces the possibility
of having skb->head be a fragment in itself. We use a new skb flag,
skb->head_frag to carry this information.
build_skb() is changed to accept a frag_size argument. Drivers willing
to provide a page fragment instead of kmalloc() data will set a non zero
value, set to the fragment size.
Then, on situations we need to convert the skb head to a frag in itself,
we can check if skb->head_frag is set and avoid the copies or various
fallbacks we have.
This means drivers currently using frags could be updated to avoid the
current skb->head allocation and reduce their memory footprint (aka skb
truesize). (thats 512 or 1024 bytes saved per skb). This also makes
bpf/netfilter faster since the 'first frag' will be part of skb linear
part, no need to copy data.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that encap_rcv() works on IPv6 UDP sockets, wire L2TP up to IPv6.
Support has been tested with and without hardware offloading. This
version fixes the L2TP over localhost issue with incorrect checksums
being reported.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't pick __u8/__u16 values directly from raw pointers, but instead use
an array of structures of code:value pairs. This is OK, since the buffer
we take options from is not an skb memory, but a user-to-kernel one.
For those options which don't require any value now, require this to be
zero (for potential future extension of this API).
v2: Changed tcp_repair_opt to use two __u32-s as spotted by David Laight.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix merge between commit 3adadc08cc ("net ax25: Reorder ax25_exit to
remove races") and commit 0ca7a4c87d ("net ax25: Simplify and
cleanup the ax25 sysctl handling")
The former moved around the sysctl register/unregister calls, the
later simply removed them.
With help from Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull powerpc fixes from Benjamin Herrenschmidt:
"Here are a few fixes for powerpc. Note the addition to the generic
irq.h. This is part of a 3-patches regression fix for mpic due to
changes in how IRQ_TYPE_NONE is being handled. Thomas agreed to the
addition of the new IRQ_TYPE_DEFAULT contant, however he hasn't
replied with an Ack to the actual patch yet. I don't to wait much
longer with these patches tho."
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/mpic: Properly set default triggers
irq: Add IRQ_TYPE_DEFAULT for use by PIC drivers
powerpc/mpic: Fix confusion between hw_irq and virq
powerpc/pmac: Don't add_timer() twice
powerpc/eeh: Fix crash caused by null eeh_dev
powerpc/mpc85xx: add MPIC message dts node
powerpc/mpic_msgr: fix offset error when setting mer register
powerpc/mpic_msgr: add lock for MPIC message global variable
powerpc/mpic_msgr: fix compile error when SMP disabled
powerpc: fix build when CONFIG_BOOKE_WDT is enabled
powerpc/85xx: don't call of_platform_bus_probe() twice
Pull networking fixes from David Miller:
1) Fix namespace init and cleanup in phonet to fix some oopses, from
Eric W. Biederman.
2) Missing kfree_skb() in AF_KEY, from Julia Lawall.
3) Refcount leak and source address handling fix in l2tp from James
Chapman.
4) Memory leak fix in CAIF from Tomasz Gregorek.
5) When routes are cloned from ipv6 addrconf routes, we don't process
expirations properly. Fix from Gao Feng.
6) Fix panic on DMA errors in atl1 driver, from Tony Zelenoff.
7) Only enable interrupts in 8139cp driver after we've registered the
IRQ handler. From Jason Wang.
8) Fix too many reads of KS_CIDER register in ks8851 during probe,
fixing crashes on spurious interrupts. From Matt Renzelmann.
9) Missing include in ath5k driver and missing iounmap on probe
failure, from Jonathan Bither.
10) Fix RX packet handling in smsc911x driver, from Will Deacon.
11) Fix ixgbe WoL on fiber by leaving the laser on during shutdown.
12) ks8851 needs MAX_RECV_FRAMES increased otherwise the internal MAC
buffers are easily overflown. Fix from Davide Cimingahi.
13) Fix memory leaks in peak_usb CAN driver, from Jesper Juhl.
14) gred packet scheduler can dump in WRED more when doing a netlink
dump. Fix from David Ward.
15) Fix MTU in USB smsc75xx driver, from Stephane Fillod.
16) Dummy device needs ->ndo_uninit handler to properly handle
->ndo_init failures. From Hiroaki SHIMODA.
17) Fix TX fragmentation in ath9k driver, from Sujith Manoharan.
18) Missing RTNL lock in ixgbe PM resume, from Benjamin Poirier.
19) Missing iounmap in farsync WAN driver, from Julia Lawall.
20) With LRO/GRO, tcp_grow_window() is easily tricked into not growing
the receive window properly, and this hurts performance. Fix from
Eric Dumazet.
21) Network namespace init failure can leak net_generic data, fix from
Julian Anastasov.
22) Fix skb_over_panic due to mis-accounting in TCP for partially ACK'd
SKBs. From Eric Dumazet.
23) New IDs for qmi_wwan driver, from Bjørn Mork.
24) Fix races in ax25_exit(), from Eric W. Biederman.
25) IPV6 TCP doesn't handle TCP_MAXSEG socket option properly, copy over
logic from the IPV4 side. From Neal Cardwell.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
tcp: fix TCP_MAXSEG for established IPv6 passive sockets
drivers/net: Do not free an IRQ if its request failed
drop_monitor: allow more events per second
ks8851: Fix request_irq/free_irq mismatch
net/hyperv: Adding cancellation to ensure rndis filter is closed
ks8851: Fix mutex deadlock in ks8851_net_stop()
net ax25: Reorder ax25_exit to remove races.
icplus: fix interrupt for IC+ 101A/G and 1001LF
net: qmi_wwan: support Sierra Wireless MC77xx devices in QMI mode
bnx2x: off by one in bnx2x_ets_e3b0_sp_pri_to_cos_set()
ksz884x: don't copy too much in netdev_set_mac_address()
tcp: fix retransmit of partially acked frames
netns: do not leak net_generic data on failed init
net/sock.h: fix sk_peek_off kernel-doc warning
tcp: fix tcp_grow_window() for large incoming frames
drivers/net/wan/farsync.c: add missing iounmap
davinci_mdio: Fix MDIO timeout check
ipv6: clean up rt6_clean_expires
ipv6: fix rt6_update_expires
arcnet: rimi: Fix device name in debug output
...
This is meant typically to allow a PIC driver's irq domain map() callback
to establish sane defaults for the interrupt (and make sure that the HW
and the irq_desc are in sync as far as the trigger is concerned).
The irq core may not call the set_trigger callback if it thinks the
trigger is already set to the right setting, so we need to ensure new
descriptors are properly synchronized with the hardware.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch changes content of hashlist (used to get port struct by
computed index (0...en_port_count-1)). Now the hash list contains only
enabled ports so userspace will be able to say what ports can be used
for tx/rx. This becomes handy when userspace will need to disable ports
which does not belong to active aggregator. By default, newly added port
is enabled.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This includes (according the the previous description):
* TCP_REPAIR sockoption
This one just puts the socket in/out of the repair mode.
Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
When repair mode is turned off and the socket happens to be in
the established state the window probe is sent to the peer to
'unlock' the connection.
* TCP_REPAIR_QUEUE sockoption
This one sets the queue which we're about to repair. The
'no-queue' is set by default.
* TCP_QUEUE_SEQ socoption
Sets the write_seq/rcv_nxt of a selected repaired queue.
Allowed for TCP_CLOSE-d sockets only. When the socket changes
its state the other seq-s are changed by the kernel according
to the protocol rules (most of the existing code is actually
reused).
* Ability to forcibly bind a socket to a port
The sk->sk_reuse is set to SK_FORCE_REUSE.
* Immediate connect modification
The connect syscall initializes the connection, then directly jumps
to the code which finalizes it.
* Silent close modification
The close just aborts the connection (similar to SO_LINGER with 0
time) but without sending any FIN/RST-s to peer.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>