It looks like the new socket options only work correctly
for native execution, but in case of compat mode fall back
to the old behavior as we ignore the 'old_timeval' flag.
Rework so we treat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW the
same way in compat and native 32-bit mode.
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Fixes: a9beb86ae6 ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The routine ptp_classifier_init() uses an initializer for an
automatic struct type variable which refers to an __initdata
symbol. This is perfectly legal, but may trigger a section
mismatch warning when running the compiler in -fpic mode, due
to the fact that the initializer may be emitted into an anonymous
.data section thats lack the __init annotation. So work around it
by using assignments instead.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub forgot to either use nlmsg_len() or nlmsg_msg_size(),
allowing KMSAN to detect a possible uninit-value in rtnl_stats_get
BUG: KMSAN: uninit-value in rtnl_stats_get+0x6d9/0x11d0 net/core/rtnetlink.c:4997
CPU: 0 PID: 10428 Comm: syz-executor034 Not tainted 5.1.0-rc2+ #24
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x173/0x1d0 lib/dump_stack.c:113
kmsan_report+0x131/0x2a0 mm/kmsan/kmsan.c:619
__msan_warning+0x7a/0xf0 mm/kmsan/kmsan_instr.c:310
rtnl_stats_get+0x6d9/0x11d0 net/core/rtnetlink.c:4997
rtnetlink_rcv_msg+0x115b/0x1550 net/core/rtnetlink.c:5192
netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2485
rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5210
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x127f/0x1300 net/netlink/af_netlink.c:1925
sock_sendmsg_nosec net/socket.c:622 [inline]
sock_sendmsg net/socket.c:632 [inline]
___sys_sendmsg+0xdb3/0x1220 net/socket.c:2137
__sys_sendmsg net/socket.c:2175 [inline]
__do_sys_sendmsg net/socket.c:2184 [inline]
__se_sys_sendmsg+0x305/0x460 net/socket.c:2182
__x64_sys_sendmsg+0x4a/0x70 net/socket.c:2182
do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
entry_SYSCALL_64_after_hwframe+0x63/0xe7
Fixes: 51bc860d4a ("rtnetlink: stats: validate attributes in get as well as dumps")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KMSAN will complain if valid address length passed to bpf_bind() is
shorter than sizeof("struct sockaddr"->sa_family) bytes.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.
As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.
Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.
It's less risky to lift up the rename restriction on failover slave
which is already UP. Although it's possible this change may potentially
break userspace component (most likely configuration scripts or
management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to listen for the rename
events on failover slaves. Userspace component interacting with slaves
is expected to be changed to operate on failover master interface
instead, as the failover slave is dynamic in nature which may come and
go at any point. The goal is to make the role of failover slaves less
relevant, and userspace components should only deal with failover master
in the long run.
Fixes: 30c8bd5aa8 ("net: Introduce generic failover module")
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-04-04
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Batch of fixes to the existing BPF flow dissector API to support
calling BPF programs from the eth_get_headlen context (support for
latter is planned to be added in bpf-next), from Stanislav.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we may merge incorrectly a received GSO packet
or a packet with frag_list into a packet sitting in the
gro_hash list. skb_segment() may crash case because
the assumptions on the skb layout are not met.
The correct behaviour would be to flush the packet in the
gro_hash list and send the received GSO packet directly
afterwards. Commit d61d072e87 ("net-gro: avoid reorders")
sets NAPI_GRO_CB(skb)->flush in this case, but this is not
checked before merging. This patch makes sure to check this
flag and to not merge in that case.
Fixes: d61d072e87 ("net-gro: avoid reorders")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use whitelist instead of a blacklist and allow only a small set of
fields that might be relevant in the context of flow dissector:
* data
* data_end
* flow_keys
This is required for the eth_get_headlen case where we have only a
chunk of data to dissect (i.e. trying to read the other skb fields
doesn't make sense).
Note, that it is a breaking API change! However, we've provided
flow_keys->n_proto as a substitute for skb->protocol; and there is
no need to manually handle skb->vlan_present. So even if we
break somebody, the migration is trivial. Unfortunately, we can't
support eth_get_headlen use-case without those breaking changes.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Don't allow BPF program to set flow_keys->nhoff to less than initial
value. We currently don't read the value afterwards in anything but
the tests, but it's still a good practice to return consistent
values to the test programs.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This is a preparation for the next commit that would prohibit access to
the most fields of __sk_buff from the BPF programs.
Instead of requiring BPF flow dissector programs to look into skb,
pass all input data in the flow_keys.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
NULL or ZERO_SIZE_PTR will be returned for zero sized memory
request, and derefencing them will lead to a segfault
so it is unnecessory to call vzalloc for zero sized memory
request and not call functions which maybe derefence the
NULL allocated memory
this also fixes a possible memory leak if phy_ethtool_get_stats
returns error, memory should be freed before exit
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Wang Li <wangli39@baidu.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_hash_mix() currently uses kernel address of a struct net,
and is used in many places that could be used to reveal this
address to a patient attacker, thus defeating KASLR, for
the typical case (initial net namespace, &init_net is
not dynamically allocated)
I believe the original implementation tried to avoid spending
too many cycles in this function, but security comes first.
Also provide entropy regardless of CONFIG_NET_NS.
Fixes: 0b4419162a ("netns: introduce the net_hash_mix "salt" for hashes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christoph reported a stall while peeking datagram with an offset when
busy polling is enabled. __skb_try_recv_datagram() uses as the loop
termination condition 'queue empty'. When peeking, the socket
queue can be not empty, even when no additional packets are received.
Address the issue explicitly checking for receive queue changes,
as currently done by __skb_wait_for_more_packets().
Fixes: 2b5cd0dfa3 ("net: Change return type of sk_busy_loop from bool to void")
Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In dumpit, unlike doit, the check for info_get op being defined
is missing. Add it and avoid null pointer dereference in case driver
does not define this op.
Fixes: f9cf22882c ("devlink: add device information API")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netdev_queue_add_kobject and rx_queue_add_kobject,
if sysfs_create_group failed, kobject_put will call
netdev_queue_release to decrease dev refcont, however
dev_hold has not be called. So we will see this while
unregistering dev:
unregister_netdevice: waiting for bcsh0 to become free. Usage count = -1
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: d0d6683716 ("net: don't decrement kobj reference count on init failure")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-03-16
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a umem memory leak on cleanup in AF_XDP, from Björn.
2) Fix BTF to properly resolve forward-declared enums into their corresponding
full enum definition types during deduplication, from Andrii.
3) Fix libbpf to reject invalid flags in xsk_socket__create(), from Magnus.
4) Fix accessing invalid pointer returned from bpf_tcp_sock() and
bpf_sk_fullsock() after bpf_sk_release() was called, from Martin.
5) Fix generation of load/store DW instructions in PPC JIT, from Naveen.
6) Various fixes in BPF helper function documentation in bpf.h UAPI header
used to bpf-helpers(7) man page, from Quentin.
7) Fix segfault in BPF test_progs when prog loading failed, from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new helper "struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk)"
which returns a bpf_sock in TCP_LISTEN state. It will trace back to
the listener sk from a request_sock if possible. It returns NULL
for all other cases.
No reference is taken because the helper ensures the sk is
in SOCK_RCU_FREE (where the TCP_LISTEN sock should be in).
Hence, bpf_sk_release() is unnecessary and the verifier does not
allow bpf_sk_release(listen_sk) to be called either.
The following is also allowed because the bpf_prog is run under
rcu_read_lock():
sk = bpf_sk_lookup_tcp();
/* if (!sk) { ... } */
listen_sk = bpf_get_listener_sock(sk);
/* if (!listen_sk) { ... } */
bpf_sk_release(sk);
src_port = listen_sk->src_port; /* Allowed */
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
can still be accessed after bpf_sk_release(sk).
Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
This patch addresses them together.
A simple reproducer looks like this:
sk = bpf_sk_lookup_tcp();
/* if (!sk) ... */
tp = bpf_tcp_sock(sk);
/* if (!tp) ... */
bpf_sk_release(sk);
snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */
The problem is the verifier did not scrub the register's states of
the tcp_sock ptr (tp) after bpf_sk_release(sk).
[ Note that when calling bpf_tcp_sock(sk), the sk is not always
refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
fine for this case. ]
Currently, the verifier does not track if a helper's return ptr (in REG_0)
is "carry"-ing one of its argument's refcount status. To carry this info,
the reg1->id needs to be stored in reg0.
One approach was tried, like "reg0->id = reg1->id", when calling
"bpf_tcp_sock()". The main idea was to avoid adding another "ref_obj_id"
for the same reg. However, overlapping the NULL marking and ref
tracking purpose in one "id" does not work well:
ref_sk = bpf_sk_lookup_tcp();
fullsock = bpf_sk_fullsock(ref_sk);
tp = bpf_tcp_sock(ref_sk);
if (!fullsock) {
bpf_sk_release(ref_sk);
return 0;
}
/* fullsock_reg->id is marked for NOT-NULL.
* Same for tp_reg->id because they have the same id.
*/
/* oops. verifier did not complain about the missing !tp check */
snd_cwnd = tp->snd_cwnd;
Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
scrub all reg states which has a ref_obj_id match. It is done with the
changes in release_reg_references() in this patch.
While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
bpf_sk_fullsock() to avoid these helpers from returning
another ptr. It will make bpf_sk_release(tp) possible:
sk = bpf_sk_lookup_tcp();
/* if (!sk) ... */
tp = bpf_tcp_sock(sk);
/* if (!tp) ... */
bpf_sk_release(tp);
A separate helper "bpf_get_listener_sock()" will be added in a later
patch to do sk_to_full_sk().
Misc change notes:
- To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON. ARG_PTR_TO_SOCKET
is removed from bpf.h since no helper is using it.
- arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not
refcounted. All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
take ARG_PTR_TO_SOCK_COMMON.
- check_refcount_ok() ensures is_acquire_function() cannot take
arg_type_may_be_refcounted() as its argument.
- The check_func_arg() can only allow one refcount-ed arg. It is
guaranteed by check_refcount_ok() which ensures at most one arg can be
refcounted. Hence, it is a verifier internal error if >1 refcount arg
found in check_func_arg().
- In release_reference(), release_reference_state() is called
first to ensure a match on "reg->ref_obj_id" can be found before
scrubbing the reg states with release_reg_references().
- reg_is_refcounted() is no longer needed.
1. In mark_ptr_or_null_regs(), its usage is replaced by
"ref_obj_id && ref_obj_id == id" because,
when is_null == true, release_reference_state() should only be
called on the ref_obj_id obtained by a acquire helper (i.e.
is_acquire_function() == true). Otherwise, the following
would happen:
sk = bpf_sk_lookup_tcp();
/* if (!sk) { ... } */
fullsock = bpf_sk_fullsock(sk);
if (!fullsock) {
/*
* release_reference_state(fullsock_reg->ref_obj_id)
* where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id.
*
* Hence, the following bpf_sk_release(sk) will fail
* because the ref state has already been released in the
* earlier release_reference_state(fullsock_reg->ref_obj_id).
*/
bpf_sk_release(sk);
}
2. In release_reg_references(), the current reg_is_refcounted() call
is unnecessary because the id check is enough.
- The type_is_refcounted() and type_is_refcounted_or_null()
are no longer needed also because reg_is_refcounted() is removed.
Fixes: 655a51e536 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull networking fixes from David Miller:
"First batch of fixes in the new merge window:
1) Double dst_cache free in act_tunnel_key, from Wenxu.
2) Avoid NULL deref in IN_DEV_MFORWARD() by failing early in the
ip_route_input_rcu() path, from Paolo Abeni.
3) Fix appletalk compile regression, from Arnd Bergmann.
4) If SLAB objects reach the TCP sendpage method we are in serious
trouble, so put a debugging check there. From Vasily Averin.
5) Memory leak in hsr layer, from Mao Wenan.
6) Only test GSO type on GSO packets, from Willem de Bruijn.
7) Fix crash in xsk_diag_put_umem(), from Eric Dumazet.
8) Fix VNIC mailbox length in nfp, from Dirk van der Merwe.
9) Fix race in ipv4 route exception handling, from Xin Long.
10) Missing DMA memory barrier in hns3 driver, from Jian Shen.
11) Use after free in __tcf_chain_put(), from Vlad Buslov.
12) Handle inet_csk_reqsk_queue_add() failures, from Guillaume Nault.
13) Return value correction when ip_mc_may_pull() fails, from Eric
Dumazet.
14) Use after free in x25_device_event(), also from Eric"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
gro_cells: make sure device is up in gro_cells_receive()
vxlan: test dev->flags & IFF_UP before calling gro_cells_receive()
net/x25: fix use-after-free in x25_device_event()
isdn: mISDNinfineon: fix potential NULL pointer dereference
net: hns3: fix to stop multiple HNS reset due to the AER changes
ip: fix ip_mc_may_pull() return value
net: keep refcount warning in reqsk_free()
net: stmmac: Avoid one more sometimes uninitialized Clang warning
net: dsa: mv88e6xxx: Set correct interface mode for CPU/DSA ports
rxrpc: Fix client call queueing, waiting for channel
tcp: handle inet_csk_reqsk_queue_add() failures
net: ethernet: sun: Zero initialize class in default case in niu_add_ethtool_tcam_entry
8139too : Add support for U.S. Robotics USR997901A 10/100 Cardbus NIC
fou, fou6: avoid uninit-value in gue_err() and gue6_err()
net: sched: fix potential use-after-free in __tcf_chain_put()
vhost: silence an unused-variable warning
vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock
connector: fix unsafe usage of ->real_parent
vxlan: do not need BH again in vxlan_cleanup()
net: hns3: add dma_rmb() for rx description
...
Daniel Borkmann says:
====================
pull-request: bpf 2019-03-09
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a crash in AF_XDP's xsk_diag_put_ring() which was passing
wrong queue argument, from Eric.
2) Fix a regression due to wrong test for TCP GSO packets used in
various BPF helpers like NAT64, from Willem.
3) Fix a sk_msg strparser warning which asserts that strparser must
be stopped first, from Jakub.
4) Fix rejection of invalid options/bind flags in AF_XDP, from Björn.
5) Fix GSO in bpf_lwt_push_ip_encap() which must properly set inner
headers and inner protocol, from Peter.
6) Fix a libbpf leak when kernel does not support BTF, from Nikita.
7) Various BPF selftest and libbpf build fixes to make out-of-tree
compilation work and to properly resolve dependencies via fixdep
target, from Stanislav.
8) Fix rejection of invalid ldimm64 imm field, from Daniel.
9) Fix bpf stats sysctl compile warning of unused helper function
proc_dointvec_minmax_bpf_stats() under some configs, from Arnd.
10) Fix couple of warnings about using plain integer as NULL, from Bo.
11) Fix some BPF sample spelling mistakes, from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>