Commit Graph

32964 Commits

Author SHA1 Message Date
David S. Miller
13209a8f73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24 13:47:27 -07:00
Linus Torvalds
9e61d12bac Merge tag 'sched-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "A set of fixes for the scheduler:

   - Fix handling of throttled parents in enqueue_task_fair() completely.

     The recent fix overlooked a corner case where the first iteration
     terminates due to an entity already being on the runqueue which
     makes the list management incomplete and later triggers the
     assertion which checks for completeness.

   - Fix a similar problem in unthrottle_cfs_rq().

   - Show the correct uclamp values in procfs which prints the effective
     value twice instead of requested and effective"

* tag 'sched-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
  sched/debug: Fix requested task uclamp values shown in procfs
  sched/fair: Fix enqueue_task_fair() warning some more
2020-05-24 10:14:58 -07:00
John Fastabend
cac616db39 bpf: Verifier track null pointer branch_taken with JNE and JEQ
Currently, when considering the branches that may be taken for a jump
instruction if the register being compared is a pointer the verifier
assumes both branches may be taken. But, if the jump instruction
is comparing if a pointer is NULL we have this information in the
verifier encoded in the reg->type so we can do better in these cases.
Specifically, these two common cases can be handled.

 * If the instruction is BPF_JEQ and we are comparing against a
   zero value. This test is 'if ptr == 0 goto +X' then using the
   type information in reg->type we can decide if the ptr is not
   null. This allows us to avoid pushing both branches onto the
   stack and instead only use the != 0 case. For example
   PTR_TO_SOCK and PTR_TO_SOCK_OR_NULL encode the null pointer.
   Note if the type is PTR_TO_SOCK_OR_NULL we can not learn anything.
   And also if the value is non-zero we learn nothing because it
   could be any arbitrary value a different pointer for example

 * If the instruction is BPF_JNE and ware comparing against a zero
   value then a similar analysis as above can be done. The test in
   asm looks like 'if ptr != 0 goto +X'. Again using the type
   information if the non null type is set (from above PTR_TO_SOCK)
   we know the jump is taken.

In this patch we extend is_branch_taken() to consider this extra
information and to return only the branch that will be taken. This
resolves a verifier issue reported with C code like the following.
See progs/test_sk_lookup_kern.c in selftests.

 sk = bpf_sk_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);
 bpf_printk("sk=%d\n", sk ? 1 : 0);
 if (sk)
   bpf_sk_release(sk);
 return sk ? TC_ACT_OK : TC_ACT_UNSPEC;

In the above the bpf_printk() will resolve the pointer from
PTR_TO_SOCK_OR_NULL to PTR_TO_SOCK. Then the second test guarding
the release will cause the verifier to walk both paths resulting
in the an unreleased sock reference. See verifier/ref_tracking.c
in selftests for an assembly version of the above.

After the above additional logic is added the C code above passes
as expected.

Reported-by: Andrey Ignatov <rdna@fb.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159009164651.6313.380418298578070501.stgit@john-Precision-5820-Tower
2020-05-21 17:44:25 -07:00
Björn Töpel
d20a1676df xsk: Move xskmap.c to net/xdp/
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from
kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related
code. Also, move AF_XDP struct definitions, and function declarations
only used by AF_XDP internals into net/xdp/xsk.h.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
Andrii Nakryiko
dfeb376dd4 bpf: Prevent mmap()'ing read-only maps as writable
As discussed in [0], it's dangerous to allow mapping BPF map, that's meant to
be frozen and is read-only on BPF program side, because that allows user-space
to actually store a writable view to the page even after it is frozen. This is
exacerbated by BPF verifier making a strong assumption that contents of such
frozen map will remain unchanged. To prevent this, disallow mapping
BPF_F_RDONLY_PROG mmap()'able BPF maps as writable, ever.

  [0] https://lore.kernel.org/bpf/CAEf4BzYGWYhXdp6BJ7_=9OQPJxQpgug080MMjdSB72i9R+5c6g@mail.gmail.com/

Fixes: fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/bpf/20200519053824.1089415-1-andriin@fb.com
2020-05-20 20:21:53 -07:00
Vincent Guittot
39f23ce07b sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair()
are quite close and follow the same sequence for enqueuing an entity in the
cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as
enqueue_task_fair(). This fixes a problem already faced with the latter and
add an optimization in the last for_each_sched_entity loop.

Fixes: fe61468b2c (sched/fair: Fix enqueue_task_fair warning)
Reported-by Tao Zhou <zohooouoto@zoho.com.cn>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org
2020-05-19 20:34:10 +02:00
Pavankumar Kondeti
ad32bb41fc sched/debug: Fix requested task uclamp values shown in procfs
The intention of commit 96e74ebf8d ("sched/debug: Add task uclamp
values to SCHED_DEBUG procfs") was to print requested and effective
task uclamp values. The requested values printed are read from p->uclamp,
which holds the last effective values. Fix this by printing the values
from p->uclamp_req.

Fixes: 96e74ebf8d ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs")
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/1589115401-26391-1-git-send-email-pkondeti@codeaurora.org
2020-05-19 20:34:10 +02:00
Phil Auld
b34cb07dde sched/fair: Fix enqueue_task_fair() warning some more
sched/fair: Fix enqueue_task_fair warning some more

The recent patch, fe61468b2c (sched/fair: Fix enqueue_task_fair warning)
did not fully resolve the issues with the rq->tmp_alone_branch !=
&rq->leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
the first for_each_sched_entity loop exits due to on_rq, having incompletely
updated the list.  In this case the second for_each_sched_entity loop can
further modify se. The later code to fix up the list management fails to do
what is needed because se does not point to the sched_entity which broke out
of the first loop. The list is not fixed up because the throttled parent was
already added back to the list by a task enqueue in a parallel child hierarchy.

Address this by calling list_add_leaf_cfs_rq if there are throttled parents
while doing the second for_each_sched_entity loop.

Fixes: fe61468b2c ("sched/fair: Fix enqueue_task_fair warning")
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200512135222.GC2201@lorien.usersys.redhat.com
2020-05-19 20:34:10 +02:00
Daniel Borkmann
1b66d25361 bpf: Add get{peer, sock}name attach types for sock_addr
As stated in 983695fa67 ("bpf: fix unconnected udp hooks"), the objective
for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
transparent to applications. In Cilium we make use of these hooks [0] in
order to enable E-W load balancing for existing Kubernetes service types
for all Cilium managed nodes in the cluster. Those backends can be local
or remote. The main advantage of this approach is that it operates as close
as possible to the socket, and therefore allows to avoid packet-based NAT
given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.

This also allows to expose NodePort services on loopback addresses in the
host namespace, for example. As another advantage, this also efficiently
blocks bind requests for applications in the host namespace for exposed
ports. However, one missing item is that we also need to perform reverse
xlation for inet{,6}_getname() hooks such that we can return the service
IP/port tuple back to the application instead of the remote peer address.

The vast majority of applications does not bother about getpeername(), but
in a few occasions we've seen breakage when validating the peer's address
since it returns unexpectedly the backend tuple instead of the service one.
Therefore, this trivial patch allows to customise and adds a getpeername()
as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
to address this situation.

Simple example:

  # ./cilium/cilium service list
  ID   Frontend     Service Type   Backend
  1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80

Before; curl's verbose output example, no getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
  > GET / HTTP/1.1
  > Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

After; with getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
  > GET / HTTP/1.1
  >  Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
peer to the context similar as in inet{,6}_getname() fashion, but API-wise
this is suboptimal as it always enforces programs having to test for ctx->peer
which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
Similarly, the checked return code is on tnum_range(1, 1), but if a use case
comes up in future, it can easily be changed to return an error code instead.
Helper and ctx member access is the same as with connect/sendmsg/etc hooks.

  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19 11:32:04 -07:00
Daniel Borkmann
2ec0616e87 bpf: Fix check_return_code to only allow [0,1] in trace_iter progs
As per 15d83c4d7c ("bpf: Allow loading of a bpf_iter program") we only
allow a range of [0,1] for return codes. Therefore BPF_TRACE_ITER relies
on the default tnum_range(0, 1) which is set in range var. On recent merge
of net into net-next commit e92888c72f ("bpf: Enforce returning 0 for
fentry/fexit progs") got pulled in and caused a merge conflict with the
changes from 15d83c4d7c. The resolution had a snall hiccup in that it
removed the [0,1] range restriction again so that BPF_TRACE_ITER would
have no enforcement. Fix it by adding it back.

Fixes: da07f52d3c ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2020-05-16 00:48:02 +02:00
David S. Miller
da07f52d3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Move the bpf verifier trace check into the new switch statement in
HEAD.

Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 13:48:59 -07:00
Linus Torvalds
f85c1598dd Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix sk_psock reference count leak on receive, from Xiyu Yang.

 2) CONFIG_HNS should be invisible, from Geert Uytterhoeven.

 3) Don't allow locking route MTUs in ipv6, RFCs actually forbid this,
    from Maciej Żenczykowski.

 4) ipv4 route redirect backoff wasn't actually enforced, from Paolo
    Abeni.

 5) Fix netprio cgroup v2 leak, from Zefan Li.

 6) Fix infinite loop on rmmod in conntrack, from Florian Westphal.

 7) Fix tcp SO_RCVLOWAT hangs, from Eric Dumazet.

 8) Various bpf probe handling fixes, from Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits)
  selftests: mptcp: pm: rm the right tmp file
  dpaa2-eth: properly handle buffer size restrictions
  bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier
  bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range
  bpf: Restrict bpf_probe_read{, str}() only to archs where they work
  MAINTAINERS: Mark networking drivers as Maintained.
  ipmr: Add lockdep expression to ipmr_for_each_table macro
  ipmr: Fix RCU list debugging warning
  drivers: net: hamradio: Fix suspicious RCU usage warning in bpqether.c
  net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810
  tcp: fix error recovery in tcp_zerocopy_receive()
  MAINTAINERS: Add Jakub to networking drivers.
  MAINTAINERS: another add of Karsten Graul for S390 networking
  drivers: ipa: fix typos for ipa_smp2p structure doc
  pppoe: only process PADT targeted at local interfaces
  selftests/bpf: Enforce returning 0 for fentry/fexit programs
  bpf: Enforce returning 0 for fentry/fexit progs
  net: stmmac: fix num_por initialization
  security: Fix the default value of secid_to_secctx hook
  libbpf: Fix register naming in PT_REGS s390 macros
  ...
2020-05-15 13:10:06 -07:00
Alexei Starovoitov
2c78ee898d bpf: Implement CAP_BPF
Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
  env->allow_ptr_leaks = bpf_allow_ptr_leaks();
  env->bypass_spec_v1 = bpf_bypass_spec_v1();
  env->bypass_spec_v4 = bpf_bypass_spec_v4();
  env->bpf_capable = bpf_capable();

The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.

'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.

That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
2020-05-15 17:29:41 +02:00
Daniel Borkmann
b2a5212fb6 bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier
Usage of plain %s conversion specifier in bpf_trace_printk() suffers from the
very same issue as bpf_probe_read{,str}() helpers, that is, it is broken on
archs with overlapping address ranges.

While the helpers have been addressed through work in 6ae08ae3de ("bpf: Add
probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers"), we need
an option for bpf_trace_printk() as well to fix it.

Similarly as with the helpers, force users to make an explicit choice by adding
%pks and %pus specifier to bpf_trace_printk() which will then pick the corresponding
strncpy_from_unsafe*() variant to perform the access under KERNEL_DS or USER_DS.
The %pk* (kernel specifier) and %pu* (user specifier) can later also be extended
for other objects aside strings that are probed and printed under tracing, and
reused out of other facilities like bpf_seq_printf() or BTF based type printing.

Existing behavior of %s for current users is still kept working for archs where it
is not broken and therefore gated through CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
For archs not having this property we fall-back to pick probing under KERNEL_DS as
a sensible default.

Fixes: 8d3b7dce86 ("bpf: add support for %s specifier to bpf_trace_printk()")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Link: https://lore.kernel.org/bpf/20200515101118.6508-4-daniel@iogearbox.net
2020-05-15 08:10:36 -07:00
Daniel Borkmann
47cc0ed574 bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range
Given bpf_probe_read{,str}() BPF helpers are now only available under
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE, we need to add the drop-in
replacements of bpf_probe_read_{kernel,user}_str() to do_refine_retval_range()
as well to avoid hitting the same issue as in 849fa50662 ("bpf/verifier:
refine retval R0 state for bpf_get_stack helper").

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200515101118.6508-3-daniel@iogearbox.net
2020-05-15 08:10:36 -07:00
Daniel Borkmann
0ebeea8ca8 bpf: Restrict bpf_probe_read{, str}() only to archs where they work
Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs
with overlapping address ranges, we should really take the next step to
disable them from BPF use there.

To generally fix the situation, we've recently added new helper variants
bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str().
For details on them, see 6ae08ae3de ("bpf: Add probe_read_{user, kernel}
and probe_read_{user,kernel}_str helpers").

Given bpf_probe_read{,str}() have been around for ~5 years by now, there
are plenty of users at least on x86 still relying on them today, so we
cannot remove them entirely w/o breaking the BPF tracing ecosystem.

However, their use should be restricted to archs with non-overlapping
address ranges where they are working in their current form. Therefore,
move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and
have x86, arm64, arm select it (other archs supporting it can follow-up
on it as well).

For the remaining archs, they can workaround easily by relying on the
feature probe from bpftool which spills out defines that can be used out
of BPF C code to implement the drop-in replacement for old/new kernels
via: bpftool feature probe macro

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net
2020-05-15 08:10:36 -07:00
Jesper Dangaard Brouer
db612f749e xdp: Cpumap redirect use frame_sz and increase skb_tailroom
Knowing the memory size backing the packet/xdp_frame data area, and
knowing it already have reserved room for skb_shared_info, simplifies
using build_skb significantly.

With this change we no-longer lie about the SKB truesize, but more
importantly a significant larger skb_tailroom is now provided, e.g. when
drivers uses a full PAGE_SIZE. This extra tailroom (in linear area) can be
used by the network stack when coalescing SKBs (e.g. in skb_try_coalesce,
see TCP cases where tcp_queue_rcv() can 'eat' skb).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945337822.97035.13557959180460986059.stgit@firesoul
2020-05-14 21:21:54 -07:00
David S. Miller
d00f26b623 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-05-14

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON.

2) support for narrow loads in bpf_sock_addr progs and additional
   helpers in cg-skb progs, from Andrey.

3) bpf benchmark runner, from Andrii.

4) arm and riscv JIT optimizations, from Luke.

5) bpf iterator infrastructure, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 20:31:21 -07:00
Andrii Nakryiko
c70f34a8ac bpf: Fix bpf_iter's task iterator logic
task_seq_get_next might stop prematurely if get_pid_task() fails to get
task_struct. Failure to do so doesn't mean that there are no more tasks with
higher pids. Procfs's iteration algorithm (see next_tgid in fs/proc/base.c)
does a retry in such case. After this fix, instead of stopping prematurely
after about 300 tasks on my server, bpf_iter program now returns >4000, which
sounds much closer to reality.

Fixes: eaaacd2391 ("bpf: Add task and task/file iterator targets")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200514055137.1564581-1-andriin@fb.com
2020-05-14 18:37:32 -07:00
Yonghong Song
e92888c72f bpf: Enforce returning 0 for fentry/fexit progs
Currently, tracing/fentry and tracing/fexit prog
return values are not enforced. In trampoline codes,
the fentry/fexit prog return values are ignored.
Let us enforce it to be 0 to avoid confusion and
allows potential future extension.

This patch also explicitly added return value
checking for tracing/raw_tp, tracing/fmod_ret,
and freplace programs such that these program
return values can be anything. The purpose are
two folds:
 1. to make it explicit about return value expectations
    for these programs in verifier.
 2. for tracing prog_type, if a future attach type
    is added, the default is -ENOTSUPP which will
    enforce to specify return value ranges explicitly.

Fixes: fec56f5890 ("bpf: Introduce BPF trampoline")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200514053206.1298415-1-yhs@fb.com
2020-05-14 12:50:10 -07:00
Andrii Nakryiko
333291ce50 bpf: Fix bug in mmap() implementation for BPF array map
mmap() subsystem allows user-space application to memory-map region with
initial page offset. This wasn't taken into account in initial implementation
of BPF array memory-mapping. This would result in wrong pages, not taking into
account requested page shift, being memory-mmaped into user-space. This patch
fixes this gap and adds a test for such scenario.

Fixes: fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200512235925.3817805-1-andriin@fb.com
2020-05-14 12:40:04 -07:00
Linus Torvalds
8c1684bb81 Merge tag 'for-linus-2020-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread fix from Christian Brauner:
 "This contains a single fix for all exported legacy fork helpers to
  block accidental access to clone3() features in the upper 32 bits of
  their respective flags arguments.

  I got Cced on a glibc issue where someone reported consistent failures
  for the legacy clone() syscall on ppc64le when sign extension was
  performed (since the clone() syscall in glibc exposes the flags
  argument as an int whereas the kernel uses unsigned long).

  The legacy clone() syscall is odd in a bunch of ways and here two
  things interact:

   - First, legacy clone's flag argument is word-size dependent, i.e.
     it's an unsigned long whereas most system calls with flag arguments
     use int or unsigned int.

   - Second, legacy clone() ignores unknown and deprecated flags.

  The two of them taken together means that users on 64bit systems can
  pass garbage for the upper 32bit of the clone() syscall since forever
  and things would just work fine.

  The following program compiled on a 64bit kernel prior to v5.7-rc1
  will succeed and will fail post v5.7-rc1 with EBADF:

    int main(int argc, char *argv[])
    {
          pid_t pid;

          /* Note that legacy clone() has different argument ordering on
           * different architectures so this won't work everywhere.
           *
           * Only set the upper 32 bits.
           */
          pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD,
                        NULL, NULL, NULL, NULL);
          if (pid < 0)
                  exit(EXIT_FAILURE);
          if (pid == 0)
                  exit(EXIT_SUCCESS);
          if (wait(NULL) != pid)
                  exit(EXIT_FAILURE);

          exit(EXIT_SUCCESS);
    }

  Since legacy clone() couldn't be extended this was not a problem so
  far and nobody really noticed or cared since nothing in the kernel
  ever bothered to look at the upper 32 bits.

  But once we introduced clone3() and expanded the flag argument in
  struct clone_args to 64 bit we opened this can of worms. With the
  first flag-based extension to clone3() making use of the upper 32 bits
  of the flag argument we've effectively made it possible for the legacy
  clone() syscall to reach clone3() only flags on accident. The sign
  extension scenario is just the odd corner-case that we needed to
  figure this out.

  The reason we just realized this now and not already when we
  introduced CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that
  a valid cgroup file descriptor has been given - whereas
  CLONE_CLEAR_SIGHAND doesn't need to verify anything. It just silently
  resets the signal handlers to SIG_DFL.

  So the sign extension (or the user accidently passing garbage for the
  upper 32 bits) caused the CLONE_INTO_CGROUP bit to be raised and the
  kernel to error out when it didn't find a valid cgroup file
  descriptor.

  Note, I'm also capping kernel_thread()'s flag argument mainly because
  none of the new features make sense for kernel_thread() and we
  shouldn't risk them being accidently activated however unlikely. If we
  wanted to, we could even make kernel_thread() yell when an unknown
  flag has been set which it doesn't do right now. But it's not worth
  risking this in a bugfix imho"

* tag 'for-linus-2020-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fork: prevent accidental access to clone3 features
2020-05-14 11:52:28 -07:00
Linus Torvalds
f44d5c4890 Merge tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing fixes from Steven Rostedt:
 "Various tracing fixes:

   - Fix a crash when having function tracing and function stack tracing
     on the command line.

     The ftrace trampolines are created as executable and read only. But
     the stack tracer tries to modify them with text_poke() which
     expects all kernel text to still be writable at boot. Keep the
     trampolines writable at boot, and convert them to read-only with
     the rest of the kernel.

   - A selftest was triggering in the ring buffer iterator code, that is
     no longer valid with the update of keeping the ring buffer writable
     while a iterator is reading.

     Just bail after three failed attempts to get an event and remove
     the warning and disabling of the ring buffer.

   - While modifying the ring buffer code, decided to remove all the
     unnecessary BUG() calls"

* tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Remove all BUG() calls
  ring-buffer: Don't deactivate the ring buffer on failed iterator reads
  x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up
2020-05-14 11:46:52 -07:00
Steven Rostedt (VMware)
da4d401a6b ring-buffer: Remove all BUG() calls
There's a lot of checks to make sure the ring buffer is working, and if an
anomaly is detected, it safely shuts itself down. But there's a few cases
that it will call BUG(), which defeats the point of being safe (it crashes
the kernel when an anomaly is found!). There's no reason for them. Switch
them all to either WARN_ON_ONCE() (when no ring buffer descriptor is present),
or to RB_WARN_ON() (when a ring buffer descriptor is present).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-14 08:51:02 -04:00
Steven Rostedt (VMware)
3d2353de81 ring-buffer: Don't deactivate the ring buffer on failed iterator reads
If the function tracer is running and the trace file is read (which uses the
ring buffer iterator), the iterator can get in sync with the writes, and
caues it to fail to find a page with content it can read three times. This
causes a warning and deactivation of the ring buffer code.

Looking at the other cases of failure to get an event, it appears that
there's a chance that the writer could cause them too. Since the iterator is
a "best effort" to read the ring buffer if there's an active writer (the
consumer reader is made for this case "see trace_pipe"), if it fails to get
an event after three tries, simply give up and return NULL. Don't warn, nor
disable the ring buffer on this failure.

Link: https://lore.kernel.org/r/20200429090508.GG5770@shao2-debian

Reported-by: kernel test robot <lkp@intel.com>
Fixes: ff84c50cfb ("ring-buffer: Do not die if rb_iter_peek() fails more than thrice")
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-14 08:50:51 -04:00