Besides others, move bpf_tail_call_proto to the remaining definitions
of other protos, improve comments a bit (i.e. remove some obvious ones,
where the code is already self-documenting, add objectives for others),
simplify bpf_prog_array_compatible() a bit.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As this is already exported from tracing side via commit d9847d310a
("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might
as well want to move it to the core, so also networking users can make
use of it, e.g. to measure diffs for certain flows from ingress/egress.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Normally the program attachment place (like sockets, qdiscs) takes
care of rcu protection and calls bpf_prog_put() after a grace period.
The programs stored inside prog_array may not be attached anywhere,
so prog_array needs to take care of preserving rcu protection.
Otherwise bpf_tail_call() will race with bpf_prog_put().
To solve that introduce bpf_prog_put_rcu() helper function and use
it in 3 places where unattached program can decrement refcnt:
closing program fd, deleting/replacing program in prog_array.
Fixes: 04fd61ab36 ("bpf: allow bpf programs to tail-call other bpf programs")
Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull block fixes from Jens Axboe:
"Three small fixes that have been picked up the last few weeks.
Specifically:
- Fix a memory corruption issue in NVMe with malignant user
constructed request. From Christoph.
- Kill (now) unused blk_queue_bio(), dm was changed to not need this
anymore. From Mike Snitzer.
- Always use blk_schedule_flush_plug() from the io_schedule() path
when flushing a plug, fixing a !TASK_RUNNING warning with md. From
Shaohua"
* 'for-linus' of git://git.kernel.dk/linux-block:
sched: always use blk_schedule_flush_plug in io_schedule_out
nvme: fix kernel memory corruption with short INQUIRY buffers
block: remove export for blk_queue_bio
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ab992dc38f ("watchdog: Fix merge 'conflict'") has introduced an
obvious deadlock because of a typo. watchdog_proc_mutex should be
unlocked on exit.
Thanks to Miroslav Benes who was staring at the code with me and noticed
this.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Duh-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two watchdog changes that came through different trees had a non
conflicting conflict, that is, one changed the semantics of a variable
but no actual code conflict happened. So the merge appeared fine, but
the resulting code did not behave as expected.
Commit 195daf665a ("watchdog: enable the new user interface of the
watchdog mechanism") changes the semantics of watchdog_user_enabled,
which thereafter is only used by the functions introduced by
b3738d2932 ("watchdog: Add watchdog enable/disable all functions").
There further appears to be a distinct lack of serialization between
setting and using watchdog_enabled, so perhaps we should wrap the
{en,dis}able_all() things in watchdog_proc_mutex.
This patch fixes a s2r failure reported by Michal; which I cannot
readily explain. But this does make the code internally consistent
again.
Reported-and-tested-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler fixes from Ingo Molnar:
"Two fixes: a suspend/resume related regression fix, and an RT priority
boosting fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix regression in cpuset_cpu_inactive() for suspend
sched: Handle priority boosted tasks proper in setscheduler()
Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also a lockdep annotation fix, a PMU event
list fix and a new model addition"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/liblockdep: Fix compilation error
tools/liblockdep: Fix linker error in case of cross compile
perf tools: Use getconf to determine number of online CPUs
tools: Fix tools/vm build
perf/x86/rapl: Enable Broadwell-U RAPL support
perf/x86/intel: Fix SLM cache event list
perf: Annotate inherited event ctx->mutex recursion
Four minor merge conflicts:
1) qca_spi.c renamed the local variable used for the SPI device
from spi_device to spi, meanwhile the spi_set_drvdata() call
got moved further up in the probe function.
2) Two changes were both adding new members to codel params
structure, and thus we had overlapping changes to the
initializer function.
3) 'net' was making a fix to sk_release_kernel() which is
completely removed in 'net-next'.
4) In net_namespace.c, the rtnl_net_fill() call for GET operations
had the command value fixed, meanwhile 'net-next' adjusted the
argument signature a bit.
This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull irq updates from Thomas Gleixner:
"Two patches from the irq departement:
- a simple fix to make dummy_irq_chip usable for wakeup scenarios
- removal of the gic arch_extn hackery. Now that all users are
converted we really want to get rid of the interface so people wont
come up with new use cases"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: gic: Drop support for gic_arch_extn
genirq: Set IRQCHIP_SKIP_SET_WAKE flag for dummy_irq_chip
Pull timer fix from Thomas Gleixner:
"A simple fix to actually shut down a detached device instead of
keeping it active"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clockevents: Shutdown detached clockevent device
Seccomp has always been a special candidate when it comes to preparation
of its filters in seccomp_prepare_filter(). Due to the extra checks and
filter rewrite it partially duplicates code and has BPF internals exposed.
This patch adds a generic API inside the BPF code code that seccomp can use
and thus keep it's filter preparation code minimal and better maintainable.
The other side-effect is that now classic JITs can add seccomp support as
well by only providing a BPF_LDX | BPF_W | BPF_ABS translation.
Tested with seccomp and BPF test suites.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nicolas Schichan <nschichan@freebox.fr>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the calls to bpf_check_classic(), bpf_convert_filter() and
bpf_migrate_runtime() and let bpf_prepare_filter() take care of that
instead.
seccomp_check_filter() is passed to bpf_prepare_filter() so that it
gets called from there, after bpf_check_classic().
We can now remove exposure of two internal classic BPF functions
previously used by seccomp. The export of bpf_check_classic() symbol,
previously known as sk_chk_filter(), was there since pre git times,
and no in-tree module was using it, therefore remove it.
Joint work with Daniel Borkmann.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull tracing fix from Steven Rostedt:
"The newly added ftrace_print_array_seq() function had a bug in it.
Luckily, the only user of it didn't make the 4.1 merge window.
But the helper function should be fixed before 4.2 when the users
start coming in"
* tag 'trace-fixes-v4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Make ftrace_print_array_seq compute buf_len
Commit 3c18d447b3 ("sched/core: Check for available DL bandwidth in
cpuset_cpu_inactive()"), a SCHED_DEADLINE bugfix, had a logic error that
caused a regression in setting a CPU inactive during suspend. I ran into
this when a program was failing pthread_setaffinity_np() with EINVAL after
a suspend+wake up.
A simple reproducer:
$ ./a.out
sched_setaffinity: Success
$ systemctl suspend
$ ./a.out
sched_setaffinity: Invalid argument
... where ./a.out is:
#define _GNU_SOURCE
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(void)
{
long num_cores;
cpu_set_t cpu_set;
int ret;
num_cores = sysconf(_SC_NPROCESSORS_ONLN);
CPU_ZERO(&cpu_set);
CPU_SET(num_cores - 1, &cpu_set);
errno = 0;
ret = sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set);
perror("sched_setaffinity");
return ret ? EXIT_FAILURE : EXIT_SUCCESS;
}
The mistake is that suspend is handled in the action ==
CPU_DOWN_PREPARE_FROZEN case of the switch statement in
cpuset_cpu_inactive().
However, the commit in question masked out CPU_TASKS_FROZEN
from the action, making this case dead.
The fix is straightforward.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 3c18d447b3 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()")
Link: http://lkml.kernel.org/r/1cb5ecb3d6543c38cce5790387f336f54ec8e2bc.1430733960.git.osandov@osandov.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ronny reported that the following scenario is not handled correctly:
T1 (prio = 10)
lock(rtmutex);
T2 (prio = 20)
lock(rtmutex)
boost T1
T1 (prio = 20)
sys_set_scheduler(prio = 30)
T1 prio = 30
....
sys_set_scheduler(prio = 10)
T1 prio = 30
The last step is wrong as T1 should now be back at prio 20.
Commit c365c292d0 ("sched: Consider pi boosting in setscheduler()")
only handles the case where a boosted tasks tries to lower its
priority.
Fix it by taking the new effective priority into account for the
decision whether a change of the priority is required.
Reported-by: Ronny Meeus <ronny.meeus@gmail.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Fixes: c365c292d0 ("sched: Consider pi boosting in setscheduler()")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU fix from Ingo Molnar:
"An RCU Kconfig fix that eliminates an annoying interactive kconfig
question for CONFIG_RCU_TORTURE_TEST_SLOW_INIT"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Control grace-period delays directly from value
Pull networking fixes from David Miller:
1) Receive packet length needs to be adjust by 2 on RX to accomodate
the two padding bytes in altera_tse driver. From Vlastimil Setka.
2) If rx frame is dropped due to out of memory in macb driver, we leave
the receive ring descriptors in an undefined state. From Punnaiah
Choudary Kalluri
3) Some netlink subsystems erroneously signal NLM_F_MULTI. That is
only for dumps. Fix from Nicolas Dichtel.
4) Fix mis-use of raw rt->rt_pmtu value in ipv4, one must always go via
the ipv4_mtu() helper. From Herbert Xu.
5) Fix null deref in bridge netfilter, and miscalculated lengths in
jump/goto nf_tables verdicts. From Florian Westphal.
6) Unhash ping sockets properly.
7) Software implementation of BPF divide did 64/32 rather than 64/64
bit divide. The JITs got it right. Fix from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
ipv4: Missing sk_nulls_node_init() in ping_unhash().
net: fec: Fix RGMII-ID mode
net/mlx4_en: Schedule napi when RX buffers allocation fails
netxen_nic: use spin_[un]lock_bh around tx_clean_lock
net/mlx4_core: Fix unaligned accesses
mlx4_en: Use correct loop cursor in error path.
cxgb4: Fix MC1 memory offset calculation
bnx2x: Delay during kdump load
net: Fix Kernel Panic in bonding driver debugfs file: rlb_hash_table
net: dsa: Fix scope of eeprom-length property
net: macb: Fix race condition in driver when Rx frame is dropped
hv_netvsc: Fix a bug in netvsc_start_xmit()
altera_tse: Correct rx packet length
mlx4: Fix tx ring affinity_mask creation
tipc: fix problem with parallel link synchronization mechanism
tipc: remove wrong use of NLM_F_MULTI
bridge/nl: remove wrong use of NLM_F_MULTI
bridge/mdb: remove wrong use of NLM_F_MULTI
net: sched: act_connmark: don't zap skb->nfct
trivial: net: systemport: bcmsysport.h: fix 0x0x prefix
...
Pull power management and ACPI fixes from Rafael Wysocki:
"Three regression fixes this time, one for a recent regression in the
cpuidle core affecting multiple systems, one for an inadvertently
added duplicate typedef in ACPICA that breaks compilation with GCC 4.5
and one for an ACPI Smart Battery Subsystem driver regression
introduced during the 3.18 cycle (stable-candidate).
Specifics:
- Fix for a regression in the cpuidle core introduced by one of the
recent commits in the clockevents_notify() removal series that put
a call to a function which had to be executed with disabled
interrupts into a code path running with enabled interrupts (Rafael
J Wysocki)
- Fix for a build problem in ACPICA (with GCC 4.5) introduced by one
of the recent ACPICA tools commits that added a duplicate typedef
to one of the ACPICA's header files by mistake (Olaf Hering)
- Fix for a regression in the ACPI SBS (Smart Battery Subsystem)
driver introduced during the 3.18 development cycle causing the
smart battery manager to be marked as not present when it should be
marked as present (Chris Bainbridge)"
* tag 'pm+acpi-4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: Run tick_broadcast_exit() with disabled interrupts
ACPI / SBS: Enable battery manager when present
ACPICA: remove duplicate u8 typedef
Pull kvm changes from Paolo Bonzini:
"Remove from guest code the handling of task migration during a pvclock
read; instead use the correct protocol in KVM.
This removes the need for task migration notifiers in core scheduler
code"
[ The scheduler people really hated the migration notifiers, so this was
kind of required - Linus ]
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
x86: pvclock: Really remove the sched notifier for cross-cpu migrations
kvm: x86: fix kvmclock update protocol