commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
y).
We miss needed barriers, even on x86, when y is not NULL.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1958 commits)
net: pack skb_shared_info more efficiently
net_sched: red: split red_parms into parms and vars
net_sched: sfq: extend limits
cnic: Improve error recovery on bnx2x devices
cnic: Re-init dev->stats_addr after chip reset
net_sched: Bug in netem reordering
bna: fix sparse warnings/errors
bna: make ethtool_ops and strings const
xgmac: cleanups
net: make ethtool_ops const
vmxnet3" make ethtool ops const
xen-netback: make ops structs const
virtio_net: Pass gfp flags when allocating rx buffers.
ixgbe: FCoE: Add support for ndo_get_fcoe_hbainfo() call
netdev: FCoE: Add new ndo_get_fcoe_hbainfo() call
igb: reset PHY after recovering from PHY power down
igb: add basic runtime PM support
igb: Add support for byte queue limits.
e1000: cleanup CE4100 MDIO registers access
e1000: unmap ce4100_gbe_mdio_base_virt in e1000_remove
...
We're doing some odd things there, which already messes up various users
(see the net/socket.c code that this removes), and it was going to add
yet more crud to the block layer because of the incorrect error code
translation.
ENOIOCTLCMD is not an error return that should be returned to user mode
from the "ioctl()" system call, but it should *not* be translated as
EINVAL ("Invalid argument"). It should be translated as ENOTTY
("Inappropriate ioctl for device").
That EINVAL confusion has apparently so permeated some code that the
block layer actually checks for it, which is sad. We continue to do so
for now, but add a big comment about how wrong that is, and we should
remove it entirely eventually. In the meantime, this tries to keep the
changes localized to just the EINVAL -> ENOTTY fix, and removing code
that makes it harder to do the right thing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Define special location values for RX NFC that request the driver to
select the actual rule location. This allows for implementation on
devices that use hash-based filter lookup, whereas currently the API is
more suited to devices with TCAM lookup or linear search.
In ethtool_set_rxnfc() and the compat wrapper ethtool_ioctl(), copy
the structure back to user-space after insertion so that the actual
location is returned.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds in the infrastructure code to create the network priority
cgroup. The cgroup, in addition to the standard processes file creates two
control files:
1) prioidx - This is a read-only file that exports the index of this cgroup.
This is a value that is both arbitrary and unique to a cgroup in this subsystem,
and is used to index the per-device priority map
2) priomap - This is a writeable file. On read it reports a table of 2-tuples
<name:priority> where name is the name of a network interface and priority is
indicates the priority assigned to frames egresessing on the named interface and
originating from a pid in this cgroup
This cgroup allows for skb priority to be set prior to a root qdisc getting
selected. This is benenficial for DCB enabled systems, in that it allows for any
application to use dcb configured priorities so without application modification
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 802.1X EAPOL handshake hostapd does requires
knowing whether the frame was ack'ed by the peer.
Currently, we fudge this pretty badly by not even
transmitting the frame as a normal data frame but
injecting it with radiotap and getting the status
out of radiotap monitor as well. This is rather
complex, confuses users (mon.wlan0 presence) and
doesn't work with all hardware.
To get rid of that hack, introduce a real wifi TX
status option for data frame transmissions.
This works similar to the existing TX timestamping
in that it reflects the SKB back to the socket's
error queue with a SCM_WIFI_STATUS cmsg that has
an int indicating ACK status (0/1).
Since it is possible that at some point we will
want to have TX timestamping and wifi status in a
single errqueue SKB (there's little point in not
doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
to SO_EE_ORIGIN_TXSTATUS which can collect more
than just the timestamp; keep the old constant
as an alias of course. Currently the internal APIs
don't make that possible, but it wouldn't be hard
to split them up in a way that makes it possible.
Thanks to Neil Horman for helping me figure out
the functions that add the control messages.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Dereferencing a user pointer directly from kernel-space without going
through the copy_from_user family of functions is a bad idea. Two of
such usages can be found in the sendmsg code path called from sendmmsg,
added by
commit c71d8ebe7a upstream.
commit 5b47b8038f in the 3.0-stable tree.
Usages are performed through memcmp() and memcpy() directly. Fix those
by using the already copied msg_sys structure instead of the __user *msg
structure. Note that msg_sys can be set to NULL by verify_compat_iovec()
or verify_iovec(), which requires additional NULL pointer checks.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: David Goulet <dgoulet@ev0ke.net>
CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
CC: Anton Blanchard <anton@samba.org>
CC: David S. Miller <davem@davemloft.net>
CC: stable <stable@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sendmmsg() introduced by commit 228e548e "net: Add sendmmsg socket system
call" is capable of sending to multiple different destination addresses.
SMACK is using destination's address for checking sendmsg() permission.
However, security_socket_sendmsg() is called for only once even if multiple
different destination addresses are passed to sendmmsg().
Therefore, we need to call security_socket_sendmsg() for each destination
address rather than only the first destination address.
Since calling security_socket_sendmsg() every time when only single destination
address was passed to sendmmsg() is a waste of time, omit calling
security_socket_sendmsg() unless destination address of previous datagram and
that of current datagram differs.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Anton Blanchard <anton@samba.org>
Cc: stable <stable@kernel.org> [3.0+]
Signed-off-by: David S. Miller <davem@davemloft.net>
To limit the amount of time we can spend in sendmmsg, cap the
number of elements to UIO_MAXIOV (currently 1024).
For error handling an application using sendmmsg needs to retry at
the first unsent message, so capping is simpler and requires less
application logic than returning EINVAL.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable <stable@kernel.org> [3.0+]
Signed-off-by: David S. Miller <davem@davemloft.net>
sendmmsg uses a similar error return strategy as recvmmsg but it
turns out to be a confusing way to communicate errors.
The current code stores the error code away and returns it on the next
sendmmsg call. This means a call with completely valid arguments could
get an error from a previous call.
Change things so we only return an error if no datagrams could be sent.
If less than the requested number of messages were sent, the application
must retry starting at the first failed one and if the problem is
persistent the error will be returned.
This matches the behaviour of other syscalls like read/write - it
is not an error if less than the requested number of elements are sent.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable <stable@kernel.org> [3.0+]
Signed-off-by: David S. Miller <davem@davemloft.net>
When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.
Convert all rcu_assign_pointer of NULL value.
//smpl
@@ expression P; @@
- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)
// </smpl>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
tg3: Remove 5719 jumbo frames and TSO blocks
tg3: Break larger frags into 4k chunks for 5719
tg3: Add tx BD budgeting code
tg3: Consolidate code that calls tg3_tx_set_bd()
tg3: Add partial fragment unmapping code
tg3: Generalize tg3_skb_error_unmap()
tg3: Remove short DMA check for 1st fragment
tg3: Simplify tx bd assignments
tg3: Reintroduce tg3_tx_ring_info
ASIX: Use only 11 bits of header for data size
ASIX: Simplify condition in rx_fixup()
Fix cdc-phonet build
bonding: reduce noise during init
bonding: fix string comparison errors
net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared
net: add IFF_SKB_TX_SHARED flag to priv_flags
net: sock_sendmsg_nosec() is static
forcedeth: fix vlans
gianfar: fix bug caused by 87c288c6e9
gro: Only reset frag0 when skb can be pulled
...
Workloads using pipes and sockets hit inode_sb_list_lock contention.
superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
macvlan: fix panic if lowerdev in a bond
tg3: Add braces around 5906 workaround.
tg3: Fix NETIF_F_LOOPBACK error
macvlan: remove one synchronize_rcu() call
networking: NET_CLS_ROUTE4 depends on INET
irda: Fix error propagation in ircomm_lmp_connect_response()
irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
be2net: Kill set but unused variable 'req' in lancer_fw_download()
irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
isdn: capi: Use pr_debug() instead of ifdefs.
tg3: Update version to 3.119
tg3: Apply rx_discards fix to 5719/5720
...
Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
recvmmsg fails on a raw socket with EINVAL. The reason for this is
packet_recvmsg checks the incoming flags:
err = -EINVAL;
if (flags & ~(MSG_PEEK|MSG_DONTWAIT|MSG_TRUNC|MSG_CMSG_COMPAT|MSG_ERRQUEUE))
goto out;
This patch strips out MSG_WAITFORONE when calling recvmmsg which
fixes the issue.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org [2.6.34+]
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a multiple message send syscall and is the send
version of the existing recvmmsg syscall. This is heavily
based on the patch by Arnaldo that added recvmmsg.
I wrote a microbenchmark to test the performance gains of using
this new syscall:
http://ozlabs.org/~anton/junkcode/sendmmsg_test.c
The test was run on a ppc64 box with a 10 Gbit network card. The
benchmark can send both UDP and RAW ethernet packets.
64B UDP
batch pkts/sec
1 804570
2 872800 (+ 8 %)
4 916556 (+14 %)
8 939712 (+17 %)
16 952688 (+18 %)
32 956448 (+19 %)
64 964800 (+20 %)
64B raw socket
batch pkts/sec
1 1201449
2 1350028 (+12 %)
4 1461416 (+22 %)
8 1513080 (+26 %)
16 1541216 (+28 %)
32 1553440 (+29 %)
64 1557888 (+30 %)
We see a 20% improvement in throughput on UDP send and 30%
on raw socket send.
[ Add sparc syscall entries. -DaveM ]
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is meant to add an ntuple data extensions to the rx network flow
classification specifiers. The idea is to allow ntuple to be displayed via
the network flow classification interface.
The first patch had some left over stuff from the original flow extension
flags I had added. That bit is removed in this patch.
The second had some left over comments that stated we ignored bits in the
masks when we actually match them.
This work is based on input from Ben Hutchings.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This structure was accidentally defined such that its layout can
differ between 32-bit and 64-bit processes. Add compat structure
definitions and an ioctl wrapper function.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: stable@kernel.org [2.6.30+]
Signed-off-by: David S. Miller <davem@davemloft.net>