Commit Graph

24315 Commits

Author SHA1 Message Date
Hannes Frederic Sowa
d24f7c7fb9 bpf: bpf_lock on kallsysms doesn't need to be irqsave
Hannes rightfully spotted that the bpf_lock doesn't need to be
irqsave variant. We never perform any such updates where this
would be necessary (neither right now nor in future), therefore
relax this further.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 15:48:14 -04:00
David S. Miller
b1513c3531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 22:39:08 -04:00
Teng Qin
8fe4592438 bpf: map_get_next_key to return first key on NULL
When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.

This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.

Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:57:45 -04:00
Daniel Borkmann
e390b55d5a bpf: make bpf_xdp_adjust_head support mandatory
Now that also the last in-tree user of the xdp_adjust_head bit has
been removed, we can remove the flag from struct bpf_prog altogether.

This, at the same time, also makes sure that any future driver for
XDP comes with bpf_xdp_adjust_head() support right away.

A rejection based on this flag would also mean that tail calls
couldn't be used with such driver as per c2002f9837 ("bpf: fix
checking xdp_adjust_head on tail calls") fix, thus lets not allow
for it in the first place.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 16:18:10 -04:00
Linus Torvalds
fa8d7cdc84 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
 "The (hopefully) final fix for the irq affinity spreading logic"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/affinity: Fix calculating vectors to assign
2017-04-23 12:48:05 -07:00
David S. Miller
fb796707d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Both conflict were simple overlapping changes.

In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 20:23:53 -07:00
David S. Miller
1f4407e254 net: Remove NET_CORE_BUDGET_USECS from sysctl binary interface.
We are not supposed to add new entries to this thing
any more.

Thanks to Eric Dumazet for noticing this.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 15:59:52 -04:00
Matthew Whitehead
7acf8a1e8a Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning
Constants used for tuning are generally a bad idea, especially as hardware
changes over time. Replace the constant 2 jiffies with sysctl variable
netdev_budget_usecs to enable sysadmins to tune the softirq processing.
Also document the variable.

For example, a very fast machine might tune this to 1000 microseconds,
while my regression testing 486DX-25 needs it to be 4000 microseconds on
a nearly idle network to prevent time_squeeze from being incremented.

Version 2: changed jiffies to microseconds for predictable units.

Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:22:34 -04:00
Linus Torvalds
160062e190 Merge tag 'trace-v4.11-rc5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull two more ftrace fixes from Steven Rostedt:
 "While continuing my development, I uncovered two more small bugs.

  One is a race condition when enabling the snapshot function probe
  trigger. It enables the probe before allocating the snapshot, and if
  the probe triggers first, it stops tracing with a warning that the
  snapshot buffer was not allocated.

  The seconds is that the snapshot file should show how to use it when
  it is empty. But a bug fix from long ago broke the "is empty" test and
  the snapshot file no longer displays the help message"

* tag 'trace-v4.11-rc5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Have ring_buffer_iter_empty() return true when empty
  tracing: Allocate the snapshot buffer before enabling probe
2017-04-20 12:30:10 -07:00
David S. Miller
7b9f6da175 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A function in kernel/bpf/syscall.c which got a bug fix in 'net'
was moved to kernel/bpf/verifier.c in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 10:35:33 -04:00
Keith Busch
b72f8051f3 genirq/affinity: Fix calculating vectors to assign
The vectors_per_node is calculated from the remaining available vectors.
The current vector starts after pre_vectors, so we need to subtract that
from the current to properly account for the number of remaining vectors
to assign.

Fixes: 3412386b53 ("irq/affinity: Fix extra vecs calculation")
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Link: http://lkml.kernel.org/r/1492645870-13019-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-20 16:03:09 +02:00
Steven Rostedt (VMware)
78f7a45dac ring-buffer: Have ring_buffer_iter_empty() return true when empty
I noticed that reading the snapshot file when it is empty no longer gives a
status. It suppose to show the status of the snapshot buffer as well as how
to allocate and use it. For example:

 ># cat snapshot
 # tracer: nop
 #
 #
 # * Snapshot is allocated *
 #
 # Snapshot commands:
 # echo 0 > snapshot : Clears and frees snapshot buffer
 # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
 #                      Takes a snapshot of the main buffer.
 # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
 #                      (Doesn't have to be '2' works with any number that
 #                       is not a '0' or '1')

But instead it just showed an empty buffer:

 ># cat snapshot
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 0/0   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |

What happened was that it was using the ring_buffer_iter_empty() function to
see if it was empty, and if it was, it showed the status. But that function
was returning false when it was empty. The reason was that the iter header
page was on the reader page, and the reader page was empty, but so was the
buffer itself. The check only tested to see if the iter was on the commit
page, but the commit page was no longer pointing to the reader page, but as
all pages were empty, the buffer is also.

Cc: stable@vger.kernel.org
Fixes: 651e22f270 ("ring-buffer: Always reset iterator to reader page")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-19 21:23:47 -04:00
Steven Rostedt (VMware)
df62db5be2 tracing: Allocate the snapshot buffer before enabling probe
Currently the snapshot trigger enables the probe and then allocates the
snapshot. If the probe triggers before the allocation, it could cause the
snapshot to fail and turn tracing off. It's best to allocate the snapshot
buffer first, and then enable the trigger. If something goes wrong in the
enabling of the trigger, the snapshot buffer is still allocated, but it can
also be freed by the user by writting zero into the snapshot buffer file.

Also add a check of the return status of alloc_snapshot().

Cc: stable@vger.kernel.org
Fixes: 77fd5c15e3 ("tracing: Add snapshot trigger to function probes")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-19 14:19:08 -04:00
Linus Torvalds
005882e53d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:
 "Two Sparc bug fixes from Daniel Jordan and Nitin Gupta"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Fix hugepage page table free
  sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL
2017-04-18 13:56:51 -07:00
Linus Torvalds
40d9018eb7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) BPF tail call handling bug fixes from Daniel Borkmann.

 2) Fix allowance of too many rx queues in sfc driver, from Bert
    Kenward.

 3) Non-loopback ipv6 packets claiming src of ::1 should be dropped,
    from Florian Westphal.

 4) Statistics requests on KSZ9031 can crash, fix from Grygorii
    Strashko.

 5) TX ring handling fixes in mediatek driver, from Sean Wang.

 6) ip_ra_control can deadlock, fix lock acquisition ordering to fix,
    from Cong WANG.

 7) Fix use after free in ip_recv_error(), from Willem de Buijn.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  bpf: fix checking xdp_adjust_head on tail calls
  bpf: fix cb access in socket filter programs on tail calls
  ipv6: drop non loopback packets claiming to originate from ::1
  net: ethernet: mediatek: fix inconsistency of port number carried in TXD
  net: ethernet: mediatek: fix inconsistency between TXD and the used buffer
  net: phy: micrel: fix crash when statistic requested for KSZ9031 phy
  net: vrf: Fix setting NLM_F_EXCL flag when adding l3mdev rule
  net: thunderx: Fix set_max_bgx_per_node for 81xx rgx
  net-timestamp: avoid use-after-free in ip_recv_error
  ipv4: fix a deadlock in ip_ra_control
  sfc: limit the number of receive queues
2017-04-18 13:24:42 -07:00
Daniel Jordan
395102db44 sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL
CONFIG_PROVE_LOCKING_SMALL shrinks the memory usage of lockdep so the
kernel text, data, and bss fit in the required 32MB limit, but this
option is not set for every config that enables lockdep.

A 4.10 kernel fails to boot with the console output

    Kernel: Using 8 locked TLB entries for main kernel image.
    hypervisor_tlb_lock[2000000:0:8000000071c007c3:1]: errors with f
    Program terminated

with these config options

    CONFIG_LOCKDEP=y
    CONFIG_LOCK_STAT=y
    CONFIG_PROVE_LOCKING=n

To fix, rename CONFIG_PROVE_LOCKING_SMALL to CONFIG_LOCKDEP_SMALL, and
enable this option with CONFIG_LOCKDEP=y so we get the reduced memory
usage every time lockdep is turned on.

Tested that CONFIG_LOCKDEP_SMALL is set to 'y' if and only if
CONFIG_LOCKDEP is set to 'y'.  When other lockdep-related config options
that select CONFIG_LOCKDEP are enabled (e.g. CONFIG_LOCK_STAT or
CONFIG_PROVE_LOCKING), verified that CONFIG_LOCKDEP_SMALL is also
enabled.

Fixes: e6b5f1be7a ("config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-18 13:11:07 -07:00
Linus Torvalds
0bad6d7e93 Merge tag 'trace-v4.11-rc5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
 "Namhyung Kim discovered a use after free bug. It has to do with adding
  a pid filter to function tracing in an instance, and then freeing the
  instance"

* tag 'trace-v4.11-rc5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix function pid filter on instances
2017-04-18 09:31:51 -07:00
Namhyung Kim
d879d0b8c1 ftrace: Fix function pid filter on instances
When function tracer has a pid filter, it adds a probe to sched_switch
to track if current task can be ignored.  The probe checks the
ftrace_ignore_pid from current tr to filter tasks.  But it misses to
delete the probe when removing an instance so that it can cause a crash
due to the invalid tr pointer (use-after-free).

This is easily reproducible with the following:

  # cd /sys/kernel/debug/tracing
  # mkdir instances/buggy
  # echo $$ > instances/buggy/set_ftrace_pid
  # rmdir instances/buggy

  ============================================================================
  BUG: KASAN: use-after-free in ftrace_filter_pid_sched_switch_probe+0x3d/0x90
  Read of size 8 by task kworker/0:1/17
  CPU: 0 PID: 17 Comm: kworker/0:1 Tainted: G    B           4.11.0-rc3  #198
  Call Trace:
   dump_stack+0x68/0x9f
   kasan_object_err+0x21/0x70
   kasan_report.part.1+0x22b/0x500
   ? ftrace_filter_pid_sched_switch_probe+0x3d/0x90
   kasan_report+0x25/0x30
   __asan_load8+0x5e/0x70
   ftrace_filter_pid_sched_switch_probe+0x3d/0x90
   ? fpid_start+0x130/0x130
   __schedule+0x571/0xce0
   ...

To fix it, use ftrace_clear_pids() to unregister the probe.  As
instance_rmdir() already updated ftrace codes, it can just free the
filter safely.

Link: http://lkml.kernel.org/r/20170417024430.21194-2-namhyung@kernel.org

Fixes: 0c8916c342 ("tracing: Add rmdir to remove multibuffer instances")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-17 16:44:23 -04:00
Daniel Borkmann
c2002f9837 bpf: fix checking xdp_adjust_head on tail calls
Commit 17bedab272 ("bpf: xdp: Allow head adjustment in XDP prog")
added the xdp_adjust_head bit to the BPF prog in order to tell drivers
that the program that is to be attached requires support for the XDP
bpf_xdp_adjust_head() helper such that drivers not supporting this
helper can reject the program. There are also drivers that do support
the helper, but need to check for xdp_adjust_head bit in order to move
packet metadata prepended by the firmware away for making headroom.

For these cases, the current check for xdp_adjust_head bit is insufficient
since there can be cases where the program itself does not use the
bpf_xdp_adjust_head() helper, but tail calls into another program that
uses bpf_xdp_adjust_head(). As such, the xdp_adjust_head bit is still
set to 0. Since the first program has no control over which program it
calls into, we need to assume that bpf_xdp_adjust_head() helper is used
upon tail calls. Thus, for the very same reasons in cb_access, set the
xdp_adjust_head bit to 1 when the main program uses tail calls.

Fixes: 17bedab272 ("bpf: xdp: Allow head adjustment in XDP prog")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:51:57 -04:00
Daniel Borkmann
6b1bb01bcc bpf: fix cb access in socket filter programs on tail calls
Commit ff936a04e5 ("bpf: fix cb access in socket filter programs")
added a fix for socket filter programs such that in i) AF_PACKET the
20 bytes of skb->cb[] area gets zeroed before use in order to not leak
data, and ii) socket filter programs attached to TCP/UDP sockets need
to save/restore these 20 bytes since they are also used by protocol
layers at that time.

The problem is that bpf_prog_run_save_cb() and bpf_prog_run_clear_cb()
only look at the actual attached program to determine whether to zero
or save/restore the skb->cb[] parts. There can be cases where the
actual attached program does not access the skb->cb[], but the program
tail calls into another program which does access this area. In such
a case, the zero or save/restore is currently not performed.

Since the programs we tail call into are unknown at verification time
and can dynamically change, we need to assume that whenever the attached
program performs a tail call, that later programs could access the
skb->cb[], and therefore we need to always set cb_access to 1.

Fixes: ff936a04e5 ("bpf: fix cb access in socket filter programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:51:57 -04:00
Martin KaFai Lau
695ba2651a bpf: lru: Lower the PERCPU_NR_SCANS from 16 to 4
After doing map_perf_test with a much bigger
BPF_F_NO_COMMON_LRU map, the perf report shows a
lot of time spent in rotating the inactive list (i.e.
__bpf_lru_list_rotate_inactive):
> map_perf_test 32 8 10000 1000000 | awk '{sum += $3}END{print sum}'
19644783 (19M/s)
> map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
6283930 (6.28M/s)

By inactive, it usually means the element is not in cache.  Hence,
there is a need to tune the PERCPU_NR_SCANS value.

This patch finds a better number of elements to
scan during each list rotation.  The PERCPU_NR_SCANS (which
is defined the same as PERCPU_FREE_TARGET) decreases
from 16 elements to 4 elements.  This change only
affects the BPF_F_NO_COMMON_LRU map.

The test_lru_dist does not show meaningful difference
between 16 and 4.  Our production L4 load balancer which uses
the LRU map for conntrack-ing also shows little change in cache
hit rate.  Since both benchmark and production data show no
cache-hit difference, PERCPU_NR_SCANS is lowered from 16 to 4.
We can consider making it configurable if we find a usecase
later that shows another value works better and/or use
a different rotation strategy.

After this change:
> map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
9240324 (9.2M/s)

i.e. 6.28M/s -> 9.2M/s

The test_lru_dist has not shown meaningful difference:
> test_lru_dist zipf.100k.a1_01.out 4000 1:
nr_misses: 31575 (Before) vs 31566 (After)

> test_lru_dist zipf.100k.a0_01.out 40000 1
nr_misses: 67036 (Before) vs 67031 (After)

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:55:52 -04:00
Linus Torvalds
11c994d9a5 Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "Unfortunately, the commit to fix the cgroup mount race in the previous
  pull request can lead to hangs.

  The original bug has been around for a while and isn't too likely to
  be triggered in usual use cases. Revert the commit for now"

* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup: avoid attaching a cgroup root to two different superblocks"
2017-04-16 11:48:10 -07:00
Linus Torvalds
48538861b9 Merge tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
 "While rewriting the function probe code, I stumbled over a long
  standing bug. This bug has been there sinc function tracing was added
  way back when. But my new development depends on this bug being fixed,
  and it should be fixed regardless as it causes ftrace to disable
  itself when triggered, and a reboot is required to enable it again.

  The bug is that the function probe does not disable itself properly if
  there's another probe of its type still enabled. For example:

     # cd /sys/kernel/debug/tracing
     # echo schedule:traceoff > set_ftrace_filter
     # echo do_IRQ:traceoff > set_ftrace_filter
     # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
     # echo do_IRQ:traceoff > set_ftrace_filter

  The above registers two traceoff probes (one for schedule and one for
  do_IRQ, and then removes do_IRQ.

  But since there still exists one for schedule, it is not done
  properly. When adding do_IRQ back, the breakage in the accounting is
  noticed by the ftrace self tests, and it causes a warning and disables
  ftrace"

* tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix removing of second function probe
2017-04-16 10:01:34 -07:00
Tejun Heo
330c418638 Revert "cgroup: avoid attaching a cgroup root to two different superblocks"
This reverts commit bfb0b80db5.

Andrei reports CRIU test hangs with the patch applied.  The bug fixed
by the patch isn't too likely to trigger in actual uses.  Revert the
patch for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Link: http://lkml.kernel.org/r/20170414232737.GC20350@outlook.office365.com
2017-04-16 23:17:37 +09:00
David S. Miller
6b6cbc1471 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simply overlapping changes.  In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.

In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-15 21:16:30 -04:00