Commit Graph

115 Commits

Author SHA1 Message Date
Alexei Starovoitov
38b3629adb net: bpf: fix bpf syscall dependence on anon_inodes
minimal configurations where EPOLL, PERF_EVENTS, etc are disabled,
but NET is enabled, are failing to build with link error:
kernel/built-in.o: In function `bpf_prog_load':
syscall.c:(.text+0x3b728): undefined reference to `anon_inode_getfd'

fix it by selecting ANON_INODES when NET is enabled

Reported-by: Michal Sojka <sojkam1@fel.cvut.cz>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-10 15:02:23 -04:00
Florian Westphal
57f5877c11 netfilter: bridge: build br_nf_core only if required
Eric reports build failure with
CONFIG_BRIDGE_NETFILTER=n

We insist to build br_nf_core.o unconditionally, but we must only do so
if br_netfilter was enabled, else it fails to build due to
functions being defined to empty stubs (and some structure members
being defined out).

Also, BRIDGE_NETFILTER=y|m makes no sense when BRIDGE=n.

Fixes: 34666d467 (netfilter: bridge: move br_netfilter out of the core)
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30 14:07:51 -04:00
Pablo Neira Ayuso
34666d467c netfilter: bridge: move br_netfilter out of the core
Jesper reported that br_netfilter always registers the hooks since
this is part of the bridge core. This harms performance for people that
don't need this.

This patch modularizes br_netfilter so it can be rmmod'ed, thus,
the hooks can be unregistered. I think the bridge netfilter should have
been a separated module since the beginning, Patrick agreed on that.

Note that this is breaking compatibility for users that expect that
bridge netfilter is going to be available after explicitly 'modprobe
bridge' or via automatic load through brctl.

However, the damage can be easily undone by modprobing br_netfilter.
The bridge core also spots a message to provide a clue to people that
didn't notice that this has been deprecated.

On top of that, the plan is that nftables will not rely on this software
layer, but integrate the connection tracking into the bridge layer to
enable stateful filtering and NAT, which is was bridge netfilter users
seem to require.

This patch still keeps the fake_dst_ops in the bridge core, since this
is required by when the bridge port is initialized. So we can safely
modprobe/rmmod br_netfilter anytime.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
2014-09-26 18:42:31 +02:00
Alexander Aring
2c6bed7cfc 6lowpan: introduce new net/6lowpan directory
This patch moves generic code which is used by bluetooth and ieee802154
6lowpan to a new net/6lowpan directory. This directory contains generic
6LoWPAN code which is shared between bluetooth and ieee802154 MAC-Layer.

This is the IPHC - "IPv6 Header Compression" format at the moment. Which
is described by RFC 6282 [0]. The BLTE 6LoWPAN draft describes that the
IPHC is the same format like IEEE 802.15.4, see [1].

Futuremore we can put more code into this directory which is shared
between BLTE and IEEE 802.15.4 6LoWPAN like RFC 6775 or the routing
protocol RPL RFC 6550.

To avoid naming conflicts I renamed 6lowpan-y to ieee802154_6lowpan-y
in net/ieee802154/Makefile.

[0] http://tools.ietf.org/html/rfc6282
[1] http://tools.ietf.org/html/draft-ietf-6lowpan-btle-12#section-3.2
[2] http://tools.ietf.org/html/rfc6775
[3] http://tools.ietf.org/html/rfc6550

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-07-12 01:53:30 +02:00
Linus Torvalds
32d01dc7be Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "A lot updates for cgroup:

   - The biggest one is cgroup's conversion to kernfs.  cgroup took
     after the long abandoned vfs-entangled sysfs implementation and
     made it even more convoluted over time.  cgroup's internal objects
     were fused with vfs objects which also brought in vfs locking and
     object lifetime rules.  Naturally, there are places where vfs rules
     don't fit and nasty hacks, such as credential switching or lock
     dance interleaving inode mutex and cgroup_mutex with object serial
     number comparison thrown in to decide whether the operation is
     actually necessary, needed to be employed.

     After conversion to kernfs, internal object lifetime and locking
     rules are mostly isolated from vfs interactions allowing shedding
     of several nasty hacks and overall simplification.  This will also
     allow implmentation of operations which may affect multiple cgroups
     which weren't possible before as it would have required nesting
     i_mutexes.

   - Various simplifications including dropping of module support,
     easier cgroup name/path handling, simplified cgroup file type
     handling and task_cg_lists optimization.

   - Prepatory changes for the planned unified hierarchy, which is still
     a patchset away from being actually operational.  The dummy
     hierarchy is updated to serve as the default unified hierarchy.
     Controllers which aren't claimed by other hierarchies are
     associated with it, which BTW was what the dummy hierarchy was for
     anyway.

   - Various fixes from Li and others.  This pull request includes some
     patches to add missing slab.h to various subsystems.  This was
     triggered xattr.h include removal from cgroup.h.  cgroup.h
     indirectly got included a lot of files which brought in xattr.h
     which brought in slab.h.

  There are several merge commits - one to pull in kernfs updates
  necessary for converting cgroup (already in upstream through
  driver-core), others for interfering changes in the fixes branch"

* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
  cgroup: remove useless argument from cgroup_exit()
  cgroup: fix spurious lockdep warning in cgroup_exit()
  cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
  cgroup: break kernfs active_ref protection in cgroup directory operations
  cgroup: fix cgroup_taskset walking order
  cgroup: implement CFTYPE_ONLY_ON_DFL
  cgroup: make cgrp_dfl_root mountable
  cgroup: drop const from @buffer of cftype->write_string()
  cgroup: rename cgroup_dummy_root and related names
  cgroup: move ->subsys_mask from cgroupfs_root to cgroup
  cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
  cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
  cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
  cgroup: reorganize cgroup bootstrapping
  cgroup: relocate setting of CGRP_DEAD
  cpuset: use rcu_read_lock() to protect task_cs()
  cgroup_freezer: document freezer_fork() subtleties
  cgroup: update cgroup_transfer_tasks() to either succeed or fail
  cgroup: drop task_lock() protection around task->cgroups
  cgroup: update how a newly forked task gets associated with css_set
  ...
2014-04-03 13:05:42 -07:00
Daniel Borkmann
408eccce32 net: ptp: move PTP classifier in its own file
This commit fixes a build error reported by Fengguang, that is
triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set:

  ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined!

The fix is to introduce its own file for the PTP BPF classifier,
so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select
it independently from each other. IXP4xx driver on ARM needs to
select it as well since it does not seem to select PTP_1588_CLOCK
or similar that would pull it in automatically.

This also allows for hiding all of the internals of the BPF PTP
program inside that file, and only exporting relevant API bits
to drivers.

This patch also adds a kdoc documentation of ptp_classify_raw()
API to make it clear that it can return PTP_CLASS_* defines. Also,
the BPF program has been translated into bpf_asm code, so that it
can be more easily read and altered (extensively documented in [1]).

In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg
tools, so the commented program can simply be translated via
`./bpf_asm -c prog` where prog is a file that contains the
commented code. This makes it easily readable/verifiable and when
there's a need to change something, jump offsets etc do not need
to be replaced manually which can be very error prone. Instead,
a newly translated version via bpf_asm can simply replace the old
code. I have checked opcode diffs before/after and it's the very
same filter.

  [1] Documentation/networking/filter.txt

Fixes: 164d8c6665 ("net: ptp: do not reimplement PTP/BPF classifier")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jiri Benc <jbenc@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-01 16:43:18 -04:00
Tejun Heo
af6363374c cgroup: make CONFIG_CGROUP_NET_PRIO bool and drop unnecessary init_netclassid_cgroup()
net_prio is the only cgroup which is allowed to be built as a module.
The savings from allowing one controller to be built as a module are
tiny especially given that cgroup module support itself adds quite a
bit of complexity.

Given that none of other controllers has much chance of being made a
module and that we're unlikely to add new modular controllers, the
added complexity is simply not justifiable.

As a first step to drop cgroup module support, this patch changes the
config option to bool from tristate and drops module related code from
it.

Also, while an earlier commit fe1217c4f3 ("net: net_cls: move
cgroupfs classid handling into core") dropped module support from
net_cls cgroup, it retained a call to cgroup_load_subsys(), which is
noop for built-in controllers.  Drop it along with
init_netclassid_cgroup().

v2: Removed modular version of task_netprioidx() in
    include/net/netprio_cgroup.h as suggested by Li Zefan.

v3: Rebased on top of fe1217c4f3 ("net: net_cls: move cgroupfs
    classid handling into core").  net_cls cgroup part is mostly
    dropped except for removal of init_netclassid_cgroup().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Thomas Graf <tgraf@suug.ch>
2014-02-08 10:36:58 -05:00
Daniel Borkmann
86f8515f97 net: netprio: rename config to be more consistent with cgroup configs
While we're at it and introduced CGROUP_NET_CLASSID, lets also make
NETPRIO_CGROUP more consistent with the rest of cgroups and rename it
into CONFIG_CGROUP_NET_PRIO so that for networking, we now have
CONFIG_CGROUP_NET_{PRIO,CLASSID}. This not only makes the CONFIG
option consistent among networking cgroups, but also among cgroups
CONFIG conventions in general as the vast majority has a prefix of
CONFIG_CGROUP_<SUBSYS>.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:42 +01:00
Daniel Borkmann
fe1217c4f3 net: net_cls: move cgroupfs classid handling into core
Zefan Li requested [1] to perform the following cleanup/refactoring:

- Split cgroupfs classid handling into net core to better express a
  possible more generic use.

- Disable module support for cgroupfs bits as the majority of other
  cgroupfs subsystems do not have that, and seems to be not wished
  from cgroup side. Zefan probably might want to follow-up for netprio
  later on.

- By this, code can be further reduced which previously took care of
  functionality built when compiled as module.

cgroupfs bits are being placed under net/core/netclassid_cgroup.c, so
that we are consistent with {netclassid,netprio}_cgroup naming that is
under net/core/ as suggested by Zefan.

No change in functionality, but only code refactoring that is being
done here.

 [1] http://patchwork.ozlabs.org/patch/304825/

Suggested-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: cgroups@vger.kernel.org
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:41 +01:00
Yuanhan Liu
044c8d4b15 kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS cleanly
Remove CONFIG_USE_GENERIC_SMP_HELPERS left by commit 0a06ff068f
("kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS").

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-21 16:42:27 -08:00
Arvid Brodin
f421436a59 net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)
High-availability Seamless Redundancy ("HSR") provides instant failover
redundancy for Ethernet networks. It requires a special network topology where
all nodes are connected in a ring (each node having two physical network
interfaces). It is suited for applications that demand high availability and
very short reaction time.

HSR acts on the Ethernet layer, using a registered Ethernet protocol type to
send special HSR frames in both directions over the ring. The driver creates
virtual network interfaces that can be used just like any ordinary Linux
network interface, for IP/TCP/UDP traffic etc. All nodes in the network ring
must be HSR capable.

This code is a "best effort" to comply with the HSR standard as described in
IEC 62439-3:2010 (HSRv0).

Signed-off-by: Arvid Brodin <arvid.brodin@xdin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-03 23:20:14 -05:00
Martin Schwidefsky
0244ad004a Remove GENERIC_HARDIRQ config option
After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-09-13 15:09:52 +02:00
David S. Miller
0e76a3a587 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge net into net-next to setup some infrastructure Eric
Dumazet needs for usbnet changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 21:36:46 -07:00
Cong Wang
e0d1095ae3 net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL
Eliezer renames several *ll_poll to *busy_poll, but forgets
CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too.

Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 15:11:17 -07:00
Thomas Graf
ffd756b317 pktgen: Require CONFIG_INET due to use of IPv4 checksum function
Unlike for IPv6, the IPv4 checksum functions are only available
if CONFIG_INET is set.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30 16:45:09 -07:00
Linus Torvalds
496322bc91 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "This is a re-do of the net-next pull request for the current merge
  window.  The only difference from the one I made the other day is that
  this has Eliezer's interface renames and the timeout handling changes
  made based upon your feedback, as well as a few bug fixes that have
  trickeled in.

  Highlights:

   1) Low latency device polling, eliminating the cost of interrupt
      handling and context switches.  Allows direct polling of a network
      device from socket operations, such as recvmsg() and poll().

      Currently ixgbe, mlx4, and bnx2x support this feature.

      Full high level description, performance numbers, and design in
      commit 0a4db187a9 ("Merge branch 'll_poll'")

      From Eliezer Tamir.

   2) With the routing cache removed, ip_check_mc_rcu() gets exercised
      more than ever before in the case where we have lots of multicast
      addresses.  Use a hash table instead of a simple linked list, from
      Eric Dumazet.

   3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
      Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
      Marek Puzyniak, Michal Kazior, and Sujith Manoharan.

   4) Support reporting the TUN device persist flag to userspace, from
      Pavel Emelyanov.

   5) Allow controlling network device VF link state using netlink, from
      Rony Efraim.

   6) Support GRE tunneling in openvswitch, from Pravin B Shelar.

   7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
      Daniel Borkmann and Eric Dumazet.

   8) Allow controlling of TCP quickack behavior on a per-route basis,
      from Cong Wang.

   9) Several bug fixes and improvements to vxlan from Stephen
      Hemminger, Pravin B Shelar, and Mike Rapoport.  In particular,
      support receiving on multiple UDP ports.

  10) Major cleanups, particular in the area of debugging and cookie
      lifetime handline, to the SCTP protocol code.  From Daniel
      Borkmann.

  11) Allow packets to cross network namespaces when traversing tunnel
      devices.  From Nicolas Dichtel.

  12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
      manner akin to how we monitor real network traffic via ptype_all.
      From Daniel Borkmann.

  13) Several bug fixes and improvements for the new alx device driver,
      from Johannes Berg.

  14) Fix scalability issues in the netem packet scheduler's time queue,
      by using an rbtree.  From Eric Dumazet.

  15) Several bug fixes in TCP loss recovery handling, from Yuchung
      Cheng.

  16) Add support for GSO segmentation of MPLS packets, from Simon
      Horman.

  17) Make network notifiers have a real data type for the opaque
      pointer that's passed into them.  Use this to properly handle
      network device flag changes in arp_netdev_event().  From Jiri
      Pirko and Timo Teräs.

  18) Convert several drivers over to module_pci_driver(), from Peter
      Huewe.

  19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
      O(1) calculation instead.  From Eric Dumazet.

  20) Support setting of explicit tunnel peer addresses in ipv6, just
      like ipv4.  From Nicolas Dichtel.

  21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.

  22) Prevent a single high rate flow from overruning an individual cpu
      during RX packet processing via selective flow shedding.  From
      Willem de Bruijn.

  23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
      Dumazet.

  24) Don't just drop GSO packets which are above the TBF scheduler's
      burst limit, chop them up so they are in-bounds instead.  Also
      from Eric Dumazet.

  25) VLAN offloads are missed when configured on top of a bridge, fix
      from Vlad Yasevich.

  26) Support IPV6 in ping sockets.  From Lorenzo Colitti.

  27) Receive flow steering targets should be updated at poll() time
      too, from David Majnemer.

  28) Fix several corner case regressions in PMTU/redirect handling due
      to the routing cache removal, from Timo Teräs.

  29) We have to be mindful of ipv4 mapped ipv6 sockets in
      upd_v6_push_pending_frames().  From Hannes Frederic Sowa.

  30) Fix L2TP sequence number handling bugs, from James Chapman."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
  drivers/net: caif: fix wrong rtnl_is_locked() usage
  drivers/net: enic: release rtnl_lock on error-path
  vhost-net: fix use-after-free in vhost_net_flush
  net: mv643xx_eth: do not use port number as platform device id
  net: sctp: confirm route during forward progress
  virtio_net: fix race in RX VQ processing
  virtio: support unlocked queue poll
  net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
  Documentation: Fix references to defunct linux-net@vger.kernel.org
  net/fs: change busy poll time accounting
  net: rename low latency sockets functions to busy poll
  bridge: fix some kernel warning in multicast timer
  sfc: Fix memory leak when discarding scattered packets
  sit: fix tunnel update via netlink
  dt:net:stmmac: Add dt specific phy reset callback support.
  dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
  dt:net:stmmac: Allocate platform data only if its NULL.
  net:stmmac: fix memleak in the open method
  ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
  net: ipv6: fix wrong ping_v6_sendmsg return value
  ...
2013-07-09 18:24:39 -07:00
Eliezer Tamir
89bf1b5a68 net: remove NET_LL_RX_POLL config menue
Remove NET_LL_RX_POLL from the config menu.
Change default to y.
Busy polling still needs to be enabled at run time.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:48:14 -07:00
Eliezer Tamir
9a3c71aa80 net: convert low latency sockets to sched_clock()
Use sched_clock() instead of get_cycles().
We can use sched_clock() because we don't care much about accuracy.
Remove the dependency on X86_TSC

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:48:14 -07:00
Eliezer Tamir
0602129286 net: add low latency socket poll
Adds an ndo_ll_poll method and the code that supports it.
This method can be used by low latency applications to busy-poll
Ethernet device queues directly from the socket code.
sysctl_net_ll_poll controls how many microseconds to poll.
Default is zero (disabled).
Individual protocol support will be added by subsequent patches.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:35 -07:00
Andy Shevchenko
4cd5773a2a net: core: move mac_pton() to lib/net_utils.c
Since we have at least one user of this function outside of CONFIG_NET
scope, we have to provide this function independently. The proposed
solution is to move it under lib/net_utils.c with corresponding
configuration variable and select wherever it is needed.

Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-05 12:00:27 -07:00
Simon Horman
0d89d2035f MPLS: Add limited GSO support
In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not support GSO of MPLS packets.

The aim of this code is to provide GSO in software for MPLS packets
whose skbs are GSO.

SKB Usage:

When an implementation adds an MPLS stack to a non-MPLS packet it should do
the following to skb metadata:

* Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
  skb->inner_protocol is added by this patch.

* Set skb->protocol to the new MPLS ethertype of the packet.

* Set skb->network_header to correspond to the
  end of the L3 header, including the MPLS label stack.

I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
That patch sets the above requirements in datapath/actions.c:push_mpls()
and was used to exercise this code.  The datapath patch is against the Open
vSwtich tree but it is intended that it be added to the Open vSwtich code
present in the mainline Linux kernel at some point.

Features:

I believe that the approach that I have taken is at least partially
consistent with the handling of other protocols.  Jesse, I understand that
you have some ideas here.  I am more than happy to change my implementation.

This patch adds dev->mpls_features which may be used by devices
to advertise features supported for MPLS packets.

A new NETIF_F_MPLS_GSO feature is added for devices which support
hardware MPLS GSO offload.  Currently no devices support this
and MPLS GSO always falls back to software.

Alternate Implementation:

One possible alternate implementation is to teach netif_skb_features()
and skb_network_protocol() about MPLS, in a similar way to their
understanding of VLANs. I believe this would avoid the need
for net/mpls/mpls_gso.c and in particular the calls to
__skb_push() and __skb_push() in mpls_gso_segment().

I have decided on the implementation in this patch as it should
not introduce any overhead in the case where mpls_gso is not compiled
into the kernel or inserted as a module.

MPLS GSO suggested by Jesse Gross.
Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
by Pravin B Shelar.

Cc: Jesse Gross <jesse@nicira.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-27 22:50:59 -07:00
Willem de Bruijn
99bbc70741 rps: selective flow shedding during softnet overflow
A cpu executing the network receive path sheds packets when its input
queue grows to netdev_max_backlog. A single high rate flow (such as a
spoofed source DoS) can exceed a single cpu processing rate and will
degrade throughput of other flows hashed onto the same cpu.

This patch adds a more fine grained hashtable. If the netdev backlog
is above a threshold, IRQ cpus track the ratio of total traffic of
each flow (using 4096 buckets, configurable). The ratio is measured
by counting the number of packets per flow over the last 256 packets
from the source cpu. Any flow that occupies a large fraction of this
(set at 50%) will see packet drop while above the threshold.

Tested:
Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0,
kernel receive (RPS) on cpu0 and application threads on cpus 2--7
each handling 20k req/s. Throughput halves when hit with a 400 kpps
antagonist storm. With this patch applied, antagonist overload is
dropped and the server processes its complete load.

The patch is effective when kernel receive processing is the
bottleneck. The above RPS scenario is a extreme, but the same is
reached with RFS and sufficient kernel processing (iptables, packet
socket tap, ..).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-20 13:48:04 -07:00
Daniel Borkmann
ee1bec9b3b netlink: kconfig: move mmap i/o into netlink kconfig
Currently, in menuconfig, Netlink's new mmaped IO is the very first
entry under the ``Networking support'' item and comes even before
``Networking options'':

  [ ]   Netlink: mmaped IO
  Networking options  --->
  ...

Lets move this into ``Networking options'' under netlink's Kconfig,
since this might be more appropriate. Introduced by commit ccdfcc398
(``netlink: mmaped netlink: ring setup'').

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-01 15:02:42 -04:00
Patrick McHardy
ccdfcc3985 netlink: mmaped netlink: ring setup
Add support for mmap'ed RX and TX ring setup and teardown based on the
af_packet.c code. The following patches will use this to add the real
mmap'ed receive and transmit functionality.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 14:57:57 -04:00
Andrey Vagin
eaaa313926 netlink: Diag core and basic socket info dumping (v2)
The netlink_diag can be built as a module, just like it's done in
unix sockets.

The core dumping message carries the basic info about netlink sockets:
family, type and protocol, portis, dst_group, dst_portid, state.

Groups can be received as an optional parameter NETLINK_DIAG_GROUPS.

Netlink sockets cab be filtered by protocols.

The socket inode number and cookie is reserved for future per-socket info
retrieving. The per-protocol filtering is also reserved for future by
requiring the sdiag_protocol to be zero.

The file /proc/net/netlink doesn't provide enough information for
dumping netlink sockets. It doesn't provide dst_group, dst_portid,
groups above 32.

v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant.

Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-21 12:38:03 -04:00