Commit Graph

281 Commits

Author SHA1 Message Date
Eric Dumazet
140c55d4b5 net-timestamp: sock_tx_timestamp() fix
sock_tx_timestamp() should not ignore initial *tx_flags value, as TCP
stack can store SKBTX_SHARED_FRAG in it.

Also first argument (struct sock *) can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 4ed2d765df ("net-timestamp: TCP timestamping")
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06 12:38:07 -07:00
Willem de Bruijn
e1c8a607b2 net-timestamp: ACK timestamp for bytestreams
Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
in the send() call is acknowledged. It implements the feature for TCP.

The timestamp is generated when the TCP socket cumulative ACK is moved
beyond the tracked seqno for the first time. The feature ignores SACK
and FACK, because those acknowledge the specific byte, but not
necessarily the entire contents of the buffer up to that byte.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn
e7fd288538 net-timestamp: SCHED timestamp on entering packet scheduler
Kernel transmit latency is often incurred in the packet scheduler.
Introduce a new timestamp on transmission just before entering the
scheduler. When data travels through multiple devices (bonding,
tunneling, ...) each device will export an individual timestamp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn
b9f40e21ef net-timestamp: move timestamp flags out of sk_flags
sk_flags is reaching its limit. New timestamping options will not fit.
Move all of them into a new field sk->sk_tsflags.

Added benefit is that this removes boilerplate code to convert between
SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.

SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
timestamp logic (netstamp_needed). That can be simplified and this
last key removed, but will leave that for a separate patch.

Signed-off-by: Willem de Bruijn <willemb@google.com>

----

The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
though that scatters tstamp fields throughout the struct.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn
f24b9be595 net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
Applications that request kernel tx timestamps with SO_TIMESTAMPING
read timestamps as recvmsg() ancillary data. The response is defined
implicitly as timespec[3].

1) define struct scm_timestamping explicitly and

2) add support for new tstamp types. On tx, scm_timestamping always
   accompanies a sock_extended_err. Define previously unused field
   ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
   define the existing behavior.

The reception path is not modified. On rx, no struct similar to
sock_extended_err is passed along with SCM_TIMESTAMPING.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:53 -07:00
Willem de Bruijn
4d276eb6a4 net: remove deprecated syststamp timestamp
The SO_TIMESTAMPING API defines three types of timestamps: software,
hardware in raw format (hwtstamp) and hardware converted to system
format (syststamp). The last has been deprecated in favor of combining
hwtstamp with a PTP clock driver. There are no active users in the
kernel.

The option was device driver dependent. If set, but without hardware
support, the correct behavior is to return zero in the relevant field
in the SCM_TIMESTAMPING ancillary message. Without device drivers
implementing the option, this field is effectively always zero.

Remove the internal plumbing to dissuage new drivers from implementing
the feature. Keep the SOF_TIMESTAMPING_SYS_HARDWARE flag, however, to
avoid breaking existing applications that request the timestamp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-29 11:39:50 -07:00
Jan Glauber
b7c0ddf5f2 net: use SYSCALL_DEFINEx for sys_recv
Make sys_recv a first class citizen by using the SYSCALL_DEFINEx
macro. Besides being cleaner this will also generate meta data
for the system call so tracing tools like ftrace or LTTng can
resolve this system call.

Signed-off-by: Jan Glauber <jan.glauber@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-16 15:15:05 -04:00
Daniel Borkmann
408eccce32 net: ptp: move PTP classifier in its own file
This commit fixes a build error reported by Fengguang, that is
triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set:

  ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined!

The fix is to introduce its own file for the PTP BPF classifier,
so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select
it independently from each other. IXP4xx driver on ARM needs to
select it as well since it does not seem to select PTP_1588_CLOCK
or similar that would pull it in automatically.

This also allows for hiding all of the internals of the BPF PTP
program inside that file, and only exporting relevant API bits
to drivers.

This patch also adds a kdoc documentation of ptp_classify_raw()
API to make it clear that it can return PTP_CLASS_* defines. Also,
the BPF program has been translated into bpf_asm code, so that it
can be more easily read and altered (extensively documented in [1]).

In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg
tools, so the commented program can simply be translated via
`./bpf_asm -c prog` where prog is a file that contains the
commented code. This makes it easily readable/verifiable and when
there's a need to change something, jump offsets etc do not need
to be replaced manually which can be very error prone. Instead,
a newly translated version via bpf_asm can simply replace the old
code. I have checked opcode diffs before/after and it's the very
same filter.

  [1] Documentation/networking/filter.txt

Fixes: 164d8c6665 ("net: ptp: do not reimplement PTP/BPF classifier")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jiri Benc <jbenc@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-01 16:43:18 -04:00
David S. Miller
85dcce7a73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/r8152.c
	drivers/net/xen-netback/netback.c

Both the r8152 and netback conflicts were simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-14 22:31:55 -04:00
Linus Torvalds
53611c0ce9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "I know this is a bit more than you want to see, and I've told the
  wireless folks under no uncertain terms that they must severely scale
  back the extent of the fixes they are submitting this late in the
  game.

  Anyways:

   1) vmxnet3's netpoll doesn't perform the equivalent of an ISR, which
      is the correct implementation, like it should.  Instead it does
      something like a NAPI poll operation.  This leads to crashes.

      From Neil Horman and Arnd Bergmann.

   2) Segmentation of SKBs requires proper socket orphaning of the
      fragments, otherwise we might access stale state released by the
      release callbacks.

      This is a 5 patch fix, but the initial patches are giving
      variables and such significantly clearer names such that the
      actual fix itself at the end looks trivial.

      From Michael S.  Tsirkin.

   3) TCP control block release can deadlock if invoked from a timer on
      an already "owned" socket.  Fix from Eric Dumazet.

   4) In the bridge multicast code, we must validate that the
      destination address of general queries is the link local all-nodes
      multicast address.  From Linus Lüssing.

   5) The x86 BPF JIT support for negative offsets puts the parameter
      for the helper function call in the wrong register.  Fix from
      Alexei Starovoitov.

   6) The descriptor type used for RTL_GIGA_MAC_VER_17 chips in the
      r8169 driver is incorrect.  Fix from Hayes Wang.

   7) The xen-netback driver tests skb_shinfo(skb)->gso_type bits to see
      if a packet is a GSO frame, but that's not the correct test.  It
      should use skb_is_gso(skb) instead.  Fix from Wei Liu.

   8) Negative msg->msg_namelen values should generate an error, from
      Matthew Leach.

   9) at86rf230 can deadlock because it takes the same lock from it's
      ISR and it's hard_start_xmit method, without disabling interrupts
      in the latter.  Fix from Alexander Aring.

  10) The FEC driver's restart doesn't perform operations in the correct
      order, so promiscuous settings can get lost.  Fix from Stefan
      Wahren.

  11) Fix SKB leak in SCTP cookie handling, from Daniel Borkmann.

  12) Reference count and memory leak fixes in TIPC from Ying Xue and
      Erik Hugne.

  13) Forced eviction in inet_frag_evictor() must strictly make sure all
      frags are deleted, otherwise module unload (f.e.  6lowpan) can
      crash.  Fix from Florian Westphal.

  14) Remove assumptions in AF_UNIX's use of csum_partial() (which it
      uses as a hash function), which breaks on PowerPC.  From Anton
      Blanchard.

      The main gist of the issue is that csum_partial() is defined only
      as a value that, once folded (f.e.  via csum_fold()) produces a
      correct 16-bit checksum.  It is legitimate, therefore, for
      csum_partial() to produce two different 32-bit values over the
      same data if their respective alignments are different.

  15) Fix endiannes bug in MAC address handling of ibmveth driver, also
      from Anton Blanchard.

  16) Error checks for ipv6 exthdrs offload registration are reversed,
      from Anton Nayshtut.

  17) Externally triggered ipv6 addrconf routes should count against the
      garbage collection threshold.  Fix from Sabrina Dubroca.

  18) The PCI shutdown handler added to the bnx2 driver can wedge the
      chip if it was not brought up earlier already, which in particular
      causes the firmware to shut down the PHY.  Fix from Michael Chan.

  19) Adjust the sanity WARN_ON_ONCE() in qdisc_list_add() because as
      currently coded it can and does trigger in legitimate situations.
      From Eric Dumazet.

  20) BNA driver fails to build on ARM because of a too large udelay()
      call, fix from Ben Hutchings.

  21) Fair-Queue qdisc holds locks during GFP_KERNEL allocations, fix
      from Eric Dumazet.

  22) The vlan passthrough ops added in the previous release causes a
      regression in source MAC address setting of outgoing headers in
      some circumstances.  Fix from Peter Boström"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (70 commits)
  ipv6: Avoid unnecessary temporary addresses being generated
  eth: fec: Fix lost promiscuous mode after reconnecting cable
  bonding: set correct vlan id for alb xmit path
  at86rf230: fix lockdep splats
  net/mlx4_en: Deregister multicast vxlan steering rules when going down
  vmxnet3: fix building without CONFIG_PCI_MSI
  MAINTAINERS: add networking selftests to NETWORKING
  net: socket: error on a negative msg_namelen
  MAINTAINERS: Add tools/net to NETWORKING [GENERAL]
  packet: doc: Spelling s/than/that/
  net/mlx4_core: Load the IB driver when the device supports IBoE
  net/mlx4_en: Handle vxlan steering rules for mac address changes
  net/mlx4_core: Fix wrong dump of the vxlan offloads device capability
  xen-netback: use skb_is_gso in xenvif_start_xmit
  r8169: fix the incorrect tx descriptor version
  tools/net/Makefile: Define PACKAGE to fix build problems
  x86: bpf_jit: support negative offsets
  bridge: multicast: enable snooping on general queries only
  bridge: multicast: add sanity check for general query destination
  tcp: tcp_release_cb() should release socket ownership
  ...
2014-03-13 20:38:36 -07:00
Matthew Leach
dbb490b965 net: socket: error on a negative msg_namelen
When copying in a struct msghdr from the user, if the user has set the
msg_namelen parameter to a negative value it gets clamped to a valid
size due to a comparison between signed and unsigned values.

Ensure the syscall errors when the user passes in a negative value.

Signed-off-by: Matthew Leach <matthew.leach@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 16:29:24 -04:00
Al Viro
00e188ef6a sockfd_lookup_light(): switch to fdget^W^Waway from fget_light
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:44:41 -04:00
Yang Yingliang
3410f22ea9 socket: replace some printk with pr_*
Prefer pr_*(...) to printk(KERN_* ...).

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 18:15:10 -05:00
Yann Droneaud
d73aa2867f net: handle error more gracefully in socketpair()
This patch makes socketpair() use error paths which do not
rely on heavy-weight call to sys_close(): it's better to try
to push the file descriptor to userspace before installing
the socket file to the file descriptor, so that errors are
catched earlier and being easier to handle.

Using sys_close() seems to be the exception, while writing the
file descriptor before installing it look like it's more or less
the norm: eg. except for code used in init/, error handling
involve fput() and put_unused_fd(), but not sys_close().

This make socketpair() usage of sys_close() quite unusual.
So it deserves to be replaced by the common pattern relying on
fput() and put_unused_fd() just like, for example, the one used
in pipe(2) or recvmsg(2).

Three distinct error paths are still needed since calling
fput() on file structure returned by sock_alloc_file() will
implicitly call sock_release() on the associated socket
structure.

Cc: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=1385979146-13825-1-git-send-email-ydroneaud@opteya.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:24:13 -05:00
David S. Miller
426e1fa31e Merge branch 'siocghwtstamp' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next
Ben Hutchings says:

====================
SIOCGHWTSTAMP ioctl

1. Add the SIOCGHWTSTAMP ioctl and update the timestamping
documentation.
2. Implement SIOCGHWTSTAMP in most drivers that support SIOCSHWTSTAMP.
3. Add a test program to exercise SIOC{G,S}HWTSTAMP.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-05 19:45:14 -05:00
Dan Carpenter
db31c55a6f net: clamp ->msg_namelen instead of returning an error
If kmsg->msg_namelen > sizeof(struct sockaddr_storage) then in the
original code that would lead to memory corruption in the kernel if you
had audit configured.  If you didn't have audit configured it was
harmless.

There are some programs such as beta versions of Ruby which use too
large of a buffer and returning an error code breaks them.  We should
clamp the ->msg_namelen value instead.

Fixes: 1661bf364a ("net: heap overflow in __audit_sockaddr()")
Reported-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Tested-by: Eric Wong <normalperson@yhbt.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-29 16:12:52 -05:00
Hannes Frederic Sowa
68c6beb373 net: add BUG_ON if kernel advertises msg_namelen > sizeof(struct sockaddr_storage)
In that case it is probable that kernel code overwrote part of the
stack. So we should bail out loudly here.

The BUG_ON may be removed in future if we are sure all protocols are
conformant.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-20 21:52:30 -05:00
Hannes Frederic Sowa
f3d3342602 net: rework recvmsg handler msg_name and msg_namelen logic
This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.

This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.

Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.

Also document these changes in include/linux/net.h as suggested by David
Miller.

Changes since RFC:

Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.

With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
	msg->msg_name = NULL
".

This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.

Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.

Cc: David Miller <davem@davemloft.net>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-20 21:52:30 -05:00
Ben Hutchings
fd468c74bd net_tstamp: Add SIOCGHWTSTAMP ioctl to match SIOCSHWTSTAMP
SIOCSHWTSTAMP returns the real configuration to the application
using it, but there is currently no way for any other
application to find out the configuration non-destructively.
Add a new ioctl for this, making it unprivileged.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2013-11-19 19:07:21 +00:00
Ben Hutchings
590d4693fb net/compat: Merge multiple implementations of ifreq::ifr_data conversion
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2013-11-18 23:50:13 +00:00
Ben Hutchings
417c3522b3 net/compat: Fix minor information leak in siocdevprivate_ioctl()
We don't need to check that ifr_data itself is a valid user pointer,
but we should check &ifr_data is.  Thankfully the copy of ifr_name is
checked, so this can only leak a few bytes from immediately above the
user address limit.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2013-11-18 23:50:12 +00:00
Dan Carpenter
1661bf364a net: heap overflow in __audit_sockaddr()
We need to cap ->msg_namelen or it leads to a buffer overflow when we
to the memcpy() in __audit_sockaddr().  It requires CAP_AUDIT_CONTROL to
exploit this bug.

The call tree is:
___sys_recvmsg()
  move_addr_to_user()
    audit_sockaddr()
      __audit_sockaddr()

Reported-by: Jüri Aedla <juri.aedla@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 16:05:14 -04:00
Linus Torvalds
9bf12df31f Merge git://git.kvack.org/~bcrl/aio-next
Pull aio changes from Ben LaHaise:
 "First off, sorry for this pull request being late in the merge window.
  Al had raised a couple of concerns about 2 items in the series below.
  I addressed the first issue (the race introduced by Gu's use of
  mm_populate()), but he has not provided any further details on how he
  wants to rework the anon_inode.c changes (which were sent out months
  ago but have yet to be commented on).

  The bulk of the changes have been sitting in the -next tree for a few
  months, with all the issues raised being addressed"

* git://git.kvack.org/~bcrl/aio-next: (22 commits)
  aio: rcu_read_lock protection for new rcu_dereference calls
  aio: fix race in ring buffer page lookup introduced by page migration support
  aio: fix rcu sparse warnings introduced by ioctx table lookup patch
  aio: remove unnecessary debugging from aio_free_ring()
  aio: table lookup: verify ctx pointer
  staging/lustre: kiocb->ki_left is removed
  aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
  aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
  aio: convert the ioctx list to table lookup v3
  aio: double aio_max_nr in calculations
  aio: Kill ki_dtor
  aio: Kill ki_users
  aio: Kill unneeded kiocb members
  aio: Kill aio_rw_vect_retry()
  aio: Don't use ctx->tail unnecessarily
  aio: io_cancel() no longer returns the io_event
  aio: percpu ioctx refcount
  aio: percpu reqs_available
  aio: reqs_active -> reqs_available
  aio: fix build when migration is disabled
  ...
2013-09-13 10:55:58 -07:00
Mathieu Desnoyers
3ddc5b46a8 kernel-wide: fix missing validations on __get/__put/__copy_to/__copy_from_user()
I found the following pattern that leads in to interesting findings:

  grep -r "ret.*|=.*__put_user" *
  grep -r "ret.*|=.*__get_user" *
  grep -r "ret.*|=.*__copy" *

The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
since those appear in compat code, we could probably expect the kernel
addresses not to be reachable in the lower 32-bit range, so I think they
might not be exploitable.

For the "__get_user" cases, I don't think those are exploitable: the worse
that can happen is that the kernel will copy kernel memory into in-kernel
buffers, and will fail immediately afterward.

The alpha csum_partial_copy_from_user() seems to be missing the
access_ok() check entirely.  The fix is inspired from x86.  This could
lead to information leak on alpha.  I also noticed that many architectures
map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
wonder if the latter is performing the access checks on every
architectures.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:18 -07:00
Cong Wang
e0d1095ae3 net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL
Eliezer renames several *ll_poll to *busy_poll, but forgets
CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too.

Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 15:11:17 -07:00