Commit Graph

45999 Commits

Author SHA1 Message Date
Peter Zijlstra
fe9beaaa80 sched: No PREEMPT_RT=y for all{yes,mod}config
While PREEMPT_RT is undoubtedly totally awesome, it does not, at this
time, make sense to have all{yes,mod}config select it.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 35772d627b ("sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-11-07 15:25:05 +01:00
Peter Zijlstra
35772d627b sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
In order to enable PREEMPT_DYNAMIC for PREEMPT_RT, remove PREEMPT_RT
from the 'Preemption Model' choice. Strictly speaking PREEMPT_RT is
not a change in how preemption works, but rather it makes a ton more
code preemptible.

Notably, take away NONE and VOLUNTARY options for PREEMPT_RT, they make
no sense (but are techincally possible).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.441622332@infradead.org
2024-11-05 12:55:38 +01:00
Peter Zijlstra
7c70cb94d2 sched: Add Lazy preemption model
Change fair to use resched_curr_lazy(), which, when the lazy
preemption model is selected, will set TIF_NEED_RESCHED_LAZY.

This LAZY bit will be promoted to the full NEED_RESCHED bit on tick.
As such, the average delay between setting LAZY and actually
rescheduling will be TICK_NSEC/2.

In short, Lazy preemption will delay preemption for fair class but
will function as Full preemption for all the other classes, most
notably the realtime (RR/FIFO/DEADLINE) classes.

The goal is to bridge the performance gap with Voluntary, such that we
might eventually remove that option entirely.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.331243614@infradead.org
2024-11-05 12:55:38 +01:00
Peter Zijlstra
26baa1f1c4 sched: Add TIF_NEED_RESCHED_LAZY infrastructure
Add the basic infrastructure to split the TIF_NEED_RESCHED bit in two.
Either bit will cause a resched on return-to-user, but only
TIF_NEED_RESCHED will drive IRQ preemption.

No behavioural change intended.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.219540785@infradead.org
2024-11-05 12:55:37 +01:00
Thomas Gleixner
0f0d1b8e50 sched/ext: Remove sched_fork() hack
Instead of solving the underlying problem of the double invocation of
__sched_fork() for idle tasks, sched-ext decided to hack around the issue
by partially clearing out the entity struct to preserve the already
enqueued node. A provided analysis and solution has been ignored for four
months.

Now that someone else has taken care of cleaning it up, remove the
disgusting hack and clear out the full structure. Remove the comment in the
structure declaration as well, as there is no requirement for @node being
the last element anymore.

Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/87ldy82wkc.ffs@tglx
2024-11-05 12:55:37 +01:00
Thomas Gleixner
b23decf8ac sched: Initialize idle tasks only once
Idle tasks are initialized via __sched_fork() twice:

     fork_idle()
        copy_process()
	  sched_fork()
             __sched_fork()
	init_idle()
          __sched_fork()

Instead of cleaning this up, sched_ext hacked around it. Even when analyis
and solution were provided in a discussion, nobody cared to clean this up.

init_idle() is also invoked from sched_init() to initialize the boot CPU's
idle task, which requires the __sched_fork() invocation. But this can be
trivially solved by invoking __sched_fork() before init_idle() in
sched_init() and removing the __sched_fork() invocation from init_idle().

Do so and clean up the comments explaining this historical leftover.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241028103142.359584747@linutronix.de
2024-11-05 12:55:37 +01:00
Johannes Weiner
1a6151017e sched: psi: pass enqueue/dequeue flags to psi callbacks directly
What psi needs to do on each enqueue and dequeue has gotten more
subtle, and the generic sched code trying to distill this into a bool
for the callbacks is awkward.

Pass the flags directly and let psi parse them. For that to work, the
#include "stats.h" (which has the psi callback implementations) needs
to be below the flag definitions in "sched.h". Move that section
further down, next to some of the other accounting stuff.

This also puts the ENQUEUE_SAVE/RESTORE branch behind the psi jump
label, slightly reducing overhead when PSI=y but runtime disabled.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241014144358.GB1021@cmpxchg.org
2024-10-26 09:28:38 +02:00
Christian Loehle
23f1178ad7 sched/uclamp: Fix unnused variable warning
uclamp_mutex is only used for CONFIG_SYSCTL or
CONFIG_UCLAMP_TASK_GROUP so declare it __maybe_unused.

Closes: https://lore.kernel.org/oe-kbuild-all/202410060258.bPl2ZoUo-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202410250459.EJe6PJI5-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/a1e9c342-01c9-44f0-a789-2c908e57942b@arm.com
2024-10-26 09:28:37 +02:00
Ingo Molnar
d1fb8a78b2 Merge tag 'v6.12-rc4' into sched/core, to resolve conflict
Overlapping fixes solving the same bug slightly differently:

  7266f0a6d3 fs/bcachefs: Fix __wait_on_freeing_inode() definition of waitqueue entry
  3b80552e70 bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry

Use the upstream version.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-10-21 08:14:15 +02:00
Linus Torvalds
2b4d25010d Merge tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduling fixes from Borislav Petkov:

 - Add PREEMPT_RT maintainers

 - Fix another aspect of delayed dequeued tasks wrt determining their
   state, i.e., whether they're runnable or blocked

 - Handle delayed dequeued tasks and their migration wrt PSI properly

 - Fix the situation where a delayed dequeue task gets enqueued into a
   new class, which should not happen

 - Fix a case where memory allocation would happen while the runqueue
   lock is held, which is a no-no

 - Do not over-schedule when tasks with shorter slices preempt the
   currently running task

 - Make sure delayed to deque entities are properly handled before
   unthrottling

 - Other smaller cleanups and improvements

* tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Add an entry for PREEMPT_RT.
  sched/fair: Fix external p->on_rq users
  sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
  sched/core: Dequeue PSI signals for blocked tasks that are delayed
  sched: Fix delayed_dequeue vs switched_from_fair()
  sched/core: Disable page allocation in task_tick_mm_cid()
  sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()
  sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
  sched: Fix sched_delayed vs cfs_bandwidth
2024-10-20 11:30:56 -07:00
Linus Torvalds
06526daaff Merge tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ftrace fixes from Steven Rostedt:
 "A couple of fixes to function graph infrastructure:

   - Fix allocation of idle shadow stack allocation during hotplug

     If function graph tracing is started when a CPU is offline, if it
     were come online during the trace then the idle task that
     represents the CPU will not get a shadow stack allocated for it.
     This means all function graph hooks that happen while that idle
     task is running (including in interrupt mode) will have all its
     events dropped.

     Switch over to the CPU hotplug mechanism that will have any newly
     brought on line CPU get a callback that can allocate the shadow
     stack for its idle task.

   - Fix allocation size of the ret_stack_list array

     When function graph tracing converted over to allowing more than
     one user at a time, it had to convert its shadow stack from an
     array of ret_stack structures to an array of unsigned longs. The
     shadow stacks are allocated in batches of 32 at a time and assigned
     to every running task. The batch is held by the ret_stack_list
     array.

     But when the conversion happened, instead of allocating an array of
     32 pointers, it was allocated as a ret_stack itself (PAGE_SIZE).
     This ret_stack_list gets passed to a function that iterates over
     what it believes is its size defined by the
     FTRACE_RETSTACK_ALLOC_SIZE macro (which is 32).

     Luckily (PAGE_SIZE) is greater than 32 * sizeof(long), otherwise
     this would have been an array overflow. This still should be fixed
     and the ret_stack_list should be allocated to the size it is
     expected to be as someday it may end up being bigger than
     SHADOW_STACK_SIZE"

* tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fgraph: Allocate ret_stack_list with proper size
  fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks
2024-10-19 12:42:14 -07:00
Steven Rostedt
fae4078c28 fgraph: Allocate ret_stack_list with proper size
The ret_stack_list is an array of ret_stack shadow stacks for the function
graph usage. When the first function graph is enabled, all tasks in the
system get a shadow stack. The ret_stack_list is a 32 element array of
pointers to these shadow stacks. It allocates the shadow stack in batches
(32 stacks at a time), assigns them to running tasks, and continues until
all tasks are covered.

When the function graph shadow stack changed from an array of
ftrace_ret_stack structures to an array of longs, the allocation of
ret_stack_list went from allocating an array of 32 elements to just a
block defined by SHADOW_STACK_SIZE. Luckily, that's defined as PAGE_SIZE
and is much more than enough to hold 32 pointers. But it is way overkill
for the amount needed to allocate.

Change the allocation of ret_stack_list back to a kcalloc() of
FTRACE_RETSTACK_ALLOC_SIZE pointers.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241018215212.23f13f40@rorschach
Fixes: 42675b723b ("function_graph: Convert ret_stack to a series of longs")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-18 21:57:20 -04:00
Steven Rostedt
2c02f7375e fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks
The function graph infrastructure allocates a shadow stack for every task
when enabled. This includes the idle tasks. The first time the function
graph is invoked, the shadow stacks are created and never freed until the
task exits. This includes the idle tasks.

Only the idle tasks that were for online CPUs had their shadow stacks
created when function graph tracing started. If function graph tracing is
enabled and a CPU comes online, the idle task representing that CPU will
not have its shadow stack created, and all function graph tracing for that
idle task will be silently dropped.

Instead, use the CPU hotplug mechanism to allocate the idle shadow stacks.
This will include idle tasks for CPUs that come online during tracing.

This issue can be reproduced by:

 # cd /sys/kernel/tracing
 # echo 0 > /sys/devices/system/cpu/cpu1/online
 # echo 0 > set_ftrace_pid
 # echo function_graph > current_tracer
 # echo 1 > options/funcgraph-proc
 # echo 1 > /sys/devices/system/cpu/cpu1
 # grep '<idle>' per_cpu/cpu1/trace | head

Before, nothing would show up.

After:
 1)    <idle>-0    |   0.811 us    |                        __enqueue_entity();
 1)    <idle>-0    |   5.626 us    |                      } /* enqueue_entity */
 1)    <idle>-0    |               |                      dl_server_update_idle_time() {
 1)    <idle>-0    |               |                        dl_scaled_delta_exec() {
 1)    <idle>-0    |   0.450 us    |                          arch_scale_cpu_capacity();
 1)    <idle>-0    |   1.242 us    |                        }
 1)    <idle>-0    |   1.908 us    |                      }
 1)    <idle>-0    |               |                      dl_server_start() {
 1)    <idle>-0    |               |                        enqueue_dl_entity() {
 1)    <idle>-0    |               |                          task_contending() {

Note, if tracing stops and restarts, the old way would then initialize
the onlined CPUs.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20241018214300.6df82178@rorschach
Fixes: 868baf07b1 ("ftrace: Fix memory leak with function graph and cpu hotplug")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-18 21:56:56 -04:00
Linus Torvalds
3d5ad2d4ec Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Daniel Borkmann:

 - Fix BPF verifier to not affect subreg_def marks in its range
   propagation (Eduard Zingerman)

 - Fix a truncation bug in the BPF verifier's handling of
   coerce_reg_to_size_sx (Dimitar Kanaliev)

 - Fix the BPF verifier's delta propagation between linked registers
   under 32-bit addition (Daniel Borkmann)

 - Fix a NULL pointer dereference in BPF devmap due to missing rxq
   information (Florian Kauer)

 - Fix a memory leak in bpf_core_apply (Jiri Olsa)

 - Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for
   arrays of nested structs (Hou Tao)

 - Fix build ID fetching where memory areas backing the file were
   created with memfd_secret (Andrii Nakryiko)

 - Fix BPF task iterator tid filtering which was incorrectly using pid
   instead of tid (Jordan Rome)

 - Several fixes for BPF sockmap and BPF sockhash redirection in
   combination with vsocks (Michal Luczaj)

 - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered (Andrea Parri)

 - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility
   of an infinite BPF tailcall (Pu Lehui)

 - Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot
   be resolved (Thomas Weißschuh)

 - Fix a bug in kfunc BTF caching for modules where the wrong BTF object
   was returned (Toke Høiland-Jørgensen)

 - Fix a BPF selftest compilation error in cgroup-related tests with
   musl libc (Tony Ambardar)

 - Several fixes to BPF link info dumps to fill missing fields (Tyrone
   Wu)

 - Add BPF selftests for kfuncs from multiple modules, checking that the
   correct kfuncs are called (Simon Sundberg)

 - Ensure that internal and user-facing bpf_redirect flags don't overlap
   (Toke Høiland-Jørgensen)

 - Switch to use kvzmalloc to allocate BPF verifier environment (Rik van
   Riel)

 - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat
   under RT (Wander Lairson Costa)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (38 commits)
  lib/buildid: Handle memfd_secret() files in build_id_parse()
  selftests/bpf: Add test case for delta propagation
  bpf: Fix print_reg_state's constant scalar dump
  bpf: Fix incorrect delta propagation between linked registers
  bpf: Properly test iter/task tid filtering
  bpf: Fix iter/task tid filtering
  riscv, bpf: Make BPF_CMPXCHG fully ordered
  bpf, vsock: Drop static vsock_bpf_prot initialization
  vsock: Update msg_count on read_skb()
  vsock: Update rx_bytes on read_skb()
  bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock
  selftests/bpf: Add asserts for netfilter link info
  bpf: Fix link info netfilter flags to populate defrag flag
  selftests/bpf: Add test for sign extension in coerce_subreg_to_size_sx()
  selftests/bpf: Add test for truncation after sign extension in coerce_reg_to_size_sx()
  bpf: Fix truncation bug in coerce_reg_to_size_sx()
  selftests/bpf: Assert link info uprobe_multi count & path_size if unset
  bpf: Fix unpopulated path_size when uprobe_multi fields unset
  selftests/bpf: Fix cross-compiling urandom_read
  selftests/bpf: Add test for kfunc module order
  ...
2024-10-18 16:27:14 -07:00
Daniel Borkmann
3e9e708757 bpf: Fix print_reg_state's constant scalar dump
print_reg_state() should not consider adding reg->off to reg->var_off.value
when dumping scalars. Scalars can be produced with reg->off != 0 through
BPF_ADD_CONST, and thus as-is this can skew the register log dump.

Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Reported-by: Nathaniel Theis <nathaniel.theis@nccgroup.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241016134913.32249-2-daniel@iogearbox.net
2024-10-17 11:06:34 -07:00
Daniel Borkmann
3878ae04e9 bpf: Fix incorrect delta propagation between linked registers
Nathaniel reported a bug in the linked scalar delta tracking, which can lead
to accepting a program with OOB access. The specific code is related to the
sync_linked_regs() function and the BPF_ADD_CONST flag, which signifies a
constant offset between two scalar registers tracked by the same register id.

The verifier attempts to track "similar" scalars in order to propagate bounds
information learned about one scalar to others. For instance, if r1 and r2
are known to contain the same value, then upon encountering 'if (r1 != 0x1234)
goto xyz', not only does it know that r1 is equal to 0x1234 on the path where
that conditional jump is not taken, it also knows that r2 is.

Additionally, with env->bpf_capable set, the verifier will track scalars
which should be a constant delta apart (if r1 is known to be one greater than
r2, then if r1 is known to be equal to 0x1234, r2 must be equal to 0x1233.)
The code path for the latter in adjust_reg_min_max_vals() is reached when
processing both 32 and 64-bit addition operations. While adjust_reg_min_max_vals()
knows whether dst_reg was produced by a 32 or a 64-bit addition (based on the
alu32 bool), the only information saved in dst_reg is the id of the source
register (reg->id, or'ed by BPF_ADD_CONST) and the value of the constant
offset (reg->off).

Later, the function sync_linked_regs() will attempt to use this information
to propagate bounds information from one register (known_reg) to others,
meaning, for all R in linked_regs, it copies known_reg range (and possibly
adjusting delta) into R for the case of R->id == known_reg->id.

For the delta adjustment, meaning, matching reg->id with BPF_ADD_CONST, the
verifier adjusts the register as reg = known_reg; reg += delta where delta
is computed as (s32)reg->off - (s32)known_reg->off and placed as a scalar
into a fake_reg to then simulate the addition of reg += fake_reg. This is
only correct, however, if the value in reg was created by a 64-bit addition.
When reg contains the result of a 32-bit addition operation, its upper 32
bits will always be zero. sync_linked_regs() on the other hand, may cause
the verifier to believe that the addition between fake_reg and reg overflows
into those upper bits. For example, if reg was generated by adding the
constant 1 to known_reg using a 32-bit alu operation, then reg->off is 1
and known_reg->off is 0. If known_reg is known to be the constant 0xFFFFFFFF,
sync_linked_regs() will tell the verifier that reg is equal to the constant
0x100000000. This is incorrect as the actual value of reg will be 0, as the
32-bit addition will wrap around.

Example:

  0: (b7) r0 = 0;             R0_w=0
  1: (18) r1 = 0x80000001;    R1_w=0x80000001
  3: (37) r1 /= 1;            R1_w=scalar()
  4: (bf) r2 = r1;            R1_w=scalar(id=1) R2_w=scalar(id=1)
  5: (bf) r4 = r1;            R1_w=scalar(id=1) R4_w=scalar(id=1)
  6: (04) w2 += 2147483647;   R2_w=scalar(id=1+2147483647,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  7: (04) w4 += 0 ;           R4_w=scalar(id=1+0,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  8: (15) if r2 == 0x0 goto pc+1
 10: R0=0 R1=0xffffffff80000001 R2=0x7fffffff R4=0xffffffff80000001 R10=fp0

What can be seen here is that r1 is copied to r2 and r4, such that {r1,r2,r4}.id
are all the same which later lets sync_linked_regs() to be invoked. Then, in
a next step constants are added with alu32 to r2 and r4, setting their ->off,
as well as id |= BPF_ADD_CONST. Next, the conditional will bind r2 and
propagate ranges to its linked registers. The verifier now believes the upper
32 bits of r4 are r4=0xffffffff80000001, while actually r4=r1=0x80000001.

One approach for a simple fix suitable also for stable is to limit the constant
delta tracking to only 64-bit alu addition. If necessary at some later point,
BPF_ADD_CONST could be split into BPF_ADD_CONST64 and BPF_ADD_CONST32 to avoid
mixing the two under the tradeoff to further complicate sync_linked_regs().
However, none of the added tests from dedf56d775 ("selftests/bpf: Add tests
for add_const") make this necessary at this point, meaning, BPF CI also passes
with just limiting tracking to 64-bit alu addition.

Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Reported-by: Nathaniel Theis <nathaniel.theis@nccgroup.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20241016134913.32249-1-daniel@iogearbox.net
2024-10-17 11:06:34 -07:00
Jordan Rome
9495a5b731 bpf: Fix iter/task tid filtering
In userspace, you can add a tid filter by setting
the "task.tid" field for "bpf_iter_link_info".
However, `get_pid_task` when called for the
`BPF_TASK_ITER_TID` type should have been using
`PIDTYPE_PID` (tid) instead of `PIDTYPE_TGID` (pid).

Fixes: f0d74c4da1 ("bpf: Parameterize task iterators.")
Signed-off-by: Jordan Rome <linux@jordanrome.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241016210048.1213935-1-linux@jordanrome.com
2024-10-17 10:52:18 -07:00
Linus Torvalds
07d6bf634b Merge tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Current release - new code bugs:

   - eth: mlx5: HWS, don't destroy more bwc queue locks than allocated

  Previous releases - regressions:

   - ipv4: give an IPv4 dev to blackhole_netdev

   - udp: compute L4 checksum as usual when not segmenting the skb

   - tcp/dccp: don't use timer_pending() in reqsk_queue_unlink().

   - eth: mlx5e: don't call cleanup on profile rollback failure

   - eth: microchip: vcap api: fix memory leaks in
     vcap_api_encode_rule_test()

   - eth: enetc: disable Tx BD rings after they are empty

   - eth: macb: avoid 20s boot delay by skipping MDIO bus registration
     for fixed-link PHY

  Previous releases - always broken:

   - posix-clock: fix missing timespec64 check in pc_clock_settime()

   - genetlink: hold RCU in genlmsg_mcast()

   - mptcp: prevent MPC handshake on port-based signal endpoints

   - eth: vmxnet3: fix packet corruption in vmxnet3_xdp_xmit_frame

   - eth: stmmac: dwmac-tegra: fix link bring-up sequence

   - eth: bcmasp: fix potential memory leak in bcmasp_xmit()

  Misc:

   - add Andrew Lunn as a co-maintainer of all networking drivers"

* tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
  net/mlx5e: Don't call cleanup on profile rollback failure
  net/mlx5: Unregister notifier on eswitch init failure
  net/mlx5: Fix command bitmask initialization
  net/mlx5: Check for invalid vector index on EQ creation
  net/mlx5: HWS, use lock classes for bwc locks
  net/mlx5: HWS, don't destroy more bwc queue locks than allocated
  net/mlx5: HWS, fixed double free in error flow of definer layout
  net/mlx5: HWS, removed wrong access to a number of rules variable
  mptcp: pm: fix UaF read in mptcp_pm_nl_rm_addr_or_subflow
  net: ethernet: mtk_eth_soc: fix memory corruption during fq dma init
  vmxnet3: Fix packet corruption in vmxnet3_xdp_xmit_frame
  net: dsa: vsc73xx: fix reception from VLAN-unaware bridges
  net: ravb: Only advertise Rx/Tx timestamps if hardware supports it
  net: microchip: vcap api: Fix memory leaks in vcap_api_encode_rule_test()
  net: phy: mdio-bcm-unimac: Add BCM6846 support
  dt-bindings: net: brcm,unimac-mdio: Add bcm6846-mdio
  udp: Compute L4 checksum as usual when not segmenting the skb
  genetlink: hold RCU in genlmsg_mcast()
  net: dsa: mv88e6xxx: Fix the max_vid definition for the MV88E6361
  tcp/dccp: Don't use timer_pending() in reqsk_queue_unlink().
  ...
2024-10-17 09:31:18 -07:00
Ingo Molnar
be602cde65 Merge branch 'linus' into sched/urgent, to resolve conflict
Conflicts:
	kernel/sched/ext.c

There's a context conflict between this upstream commit:

  3fdb9ebcec sched_ext: Start schedulers with consistent p->scx.slice values

... and this fix in sched/urgent:

  98442f0ccd sched: Fix delayed_dequeue vs switched_from_fair()

Resolve it.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-10-17 09:58:07 +02:00
Linus Torvalds
dff6584301 Merge tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:

 - More issues reported in the enable/disable paths on large machines
   with many tasks due to scx_tasks_lock being held too long. Break up
   the task iterations

 - Remove ops.select_cpu() dependency in bypass mode so that a
   misbehaving implementation can't live-lock the machine by pushing all
   tasks to few CPUs in bypass mode

 - Other misc fixes

* tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Remove unnecessary cpu_relax()
  sched_ext: Don't hold scx_tasks_lock for too long
  sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers
  sched_ext: bypass mode shouldn't depend on ops.select_cpu()
  sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl()
  sched_ext: Start schedulers with consistent p->scx.slice values
  Revert "sched_ext: Use shorter slice while bypassing"
  sched_ext: use correct function name in pick_task_scx() warning message
  selftests: sched_ext: Add sched_ext as proper selftest target
2024-10-15 19:47:19 -07:00
Linus Torvalds
2f87d0916c Merge tag 'trace-ringbuffer-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer fixes from Steven Rostedt:

 - Fix ref counter of buffers assigned at boot up

   A tracing instance can be created from the kernel command line. If it
   maps to memory, it is considered permanent and should not be deleted,
   or bad things can happen. If it is not mapped to memory, then the
   user is fine to delete it via rmdir from the instances directory. But
   the ref counts assumed 0 was free to remove and greater than zero was
   not. But this was not the case. When an instance is created, it
   should have the reference of 1, and if it should not be removed, it
   must be greater than 1. The boot up code set normal instances with a
   ref count of 0, which could get removed if something accessed it and
   then released it. And memory mapped instances had a ref count of 1
   which meant it could be deleted, and bad things happen. Keep normal
   instances ref count as 1, and set memory mapped instances ref count
   to 2.

 - Protect sub buffer size (order) updates from other modifications

   When a ring buffer is changing the size of its sub-buffers, no other
   operations should be performed on the ring buffer. That includes
   reading it. But the locking only grabbed the buffer->mutex that keeps
   some operations from touching the ring buffer. It also must hold the
   cpu_buffer->reader_lock as well when updates happen as other paths
   use that to do some operations on the ring buffer.

* tag 'trace-ringbuffer-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Fix reader locking when changing the sub buffer order
  ring-buffer: Fix refcount setting of boot mapped buffers
2024-10-15 11:18:44 -07:00
Dimitar Kanaliev
ae67b9fb8c bpf: Fix truncation bug in coerce_reg_to_size_sx()
coerce_reg_to_size_sx() updates the register state after a sign-extension
operation. However, there's a bug in the assignment order of the unsigned
min/max values, leading to incorrect truncation:

  0: (85) call bpf_get_prandom_u32#7    ; R0_w=scalar()
  1: (57) r0 &= 1                       ; R0_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1))
  2: (07) r0 += 254                     ; R0_w=scalar(smin=umin=smin32=umin32=254,smax=umax=smax32=umax32=255,var_off=(0xfe; 0x1))
  3: (bf) r0 = (s8)r0                   ; R0_w=scalar(smin=smin32=-2,smax=smax32=-1,umin=umin32=0xfffffffe,umax=0xffffffff,var_off=(0xfffffffffffffffe; 0x1))

In the current implementation, the unsigned 32-bit min/max values
(u32_min_value and u32_max_value) are assigned directly from the 64-bit
signed min/max values (s64_min and s64_max):

  reg->umin_value = reg->u32_min_value = s64_min;
  reg->umax_value = reg->u32_max_value = s64_max;

Due to the chain assigmnent, this is equivalent to:

  reg->u32_min_value = s64_min;  // Unintended truncation
  reg->umin_value = reg->u32_min_value;
  reg->u32_max_value = s64_max;  // Unintended truncation
  reg->umax_value = reg->u32_max_value;

Fixes: 1f9a1ea821 ("bpf: Support new sign-extension load insns")
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Dimitar Kanaliev <dimitar.kanaliev@siteground.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20241014121155.92887-2-dimitar.kanaliev@siteground.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-15 11:16:24 -07:00
Petr Pavlu
09661f75e7 ring-buffer: Fix reader locking when changing the sub buffer order
The function ring_buffer_subbuf_order_set() updates each
ring_buffer_per_cpu and installs new sub buffers that match the requested
page order. This operation may be invoked concurrently with readers that
rely on some of the modified data, such as the head bit (RB_PAGE_HEAD), or
the ring_buffer_per_cpu.pages and reader_page pointers. However, no
exclusive access is acquired by ring_buffer_subbuf_order_set(). Modifying
the mentioned data while a reader also operates on them can then result in
incorrect memory access and various crashes.

Fix the problem by taking the reader_lock when updating a specific
ring_buffer_per_cpu in ring_buffer_subbuf_order_set().

Link: https://lore.kernel.org/linux-trace-kernel/20240715145141.5528-1-petr.pavlu@suse.com/
Link: https://lore.kernel.org/linux-trace-kernel/20241010195849.2f77cc3f@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20241011112850.17212b25@gandalf.local.home/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241015112440.26987-1-petr.pavlu@suse.com
Fixes: 8e7b58c27b ("ring-buffer: Just update the subbuffers when changing their allocation order")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-15 11:18:51 -04:00
Jinjie Ruan
d8794ac20a posix-clock: Fix missing timespec64 check in pc_clock_settime()
As Andrew pointed out, it will make sense that the PTP core
checked timespec64 struct's tv_sec and tv_nsec range before calling
ptp->info->settime64().

As the man manual of clock_settime() said, if tp.tv_sec is negative or
tp.tv_nsec is outside the range [0..999,999,999], it should return EINVAL,
which include dynamic clocks which handles PTP clock, and the condition is
consistent with timespec64_valid(). As Thomas suggested, timespec64_valid()
only check the timespec is valid, but not ensure that the time is
in a valid range, so check it ahead using timespec64_valid_strict()
in pc_clock_settime() and return -EINVAL if not valid.

There are some drivers that use tp->tv_sec and tp->tv_nsec directly to
write registers without validity checks and assume that the higher layer
has checked it, which is dangerous and will benefit from this, such as
hclge_ptp_settime(), igb_ptp_settime_i210(), _rcar_gen4_ptp_settime(),
and some drivers can remove the checks of itself.

Cc: stable@vger.kernel.org
Fixes: 0606f422b4 ("posix clocks: Introduce dynamic clocks")
Acked-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20241009072302.1754567-2-ruanjinjie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14 17:22:43 -07:00
David Vernet
60e339be10 sched_ext: Remove unnecessary cpu_relax()
As described in commit b07996c7ab ("sched_ext: Don't hold
scx_tasks_lock for too long"), we're doing a cond_resched() every 32
calls to scx_task_iter_next() to avoid RCU and other stalls. That commit
also added a cpu_relax() to the codepath where we drop and reacquire the
lock, but as Waiman described in [0], cpu_relax() should only be
necessary in busy loops to avoid pounding on a cacheline (or to allow a
hypertwin to more fully utilize a core).

Let's remove the unnecessary cpu_relax().

[0]: https://lore.kernel.org/all/35b3889b-904a-4d26-981f-c8aa1557a7c7@redhat.com/

Cc: Waiman Long <llong@redhat.com>
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-14 13:23:49 -10:00