While PREEMPT_RT is undoubtedly totally awesome, it does not, at this
time, make sense to have all{yes,mod}config select it.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 35772d627b ("sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
In order to enable PREEMPT_DYNAMIC for PREEMPT_RT, remove PREEMPT_RT
from the 'Preemption Model' choice. Strictly speaking PREEMPT_RT is
not a change in how preemption works, but rather it makes a ton more
code preemptible.
Notably, take away NONE and VOLUNTARY options for PREEMPT_RT, they make
no sense (but are techincally possible).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.441622332@infradead.org
Change fair to use resched_curr_lazy(), which, when the lazy
preemption model is selected, will set TIF_NEED_RESCHED_LAZY.
This LAZY bit will be promoted to the full NEED_RESCHED bit on tick.
As such, the average delay between setting LAZY and actually
rescheduling will be TICK_NSEC/2.
In short, Lazy preemption will delay preemption for fair class but
will function as Full preemption for all the other classes, most
notably the realtime (RR/FIFO/DEADLINE) classes.
The goal is to bridge the performance gap with Voluntary, such that we
might eventually remove that option entirely.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.331243614@infradead.org
Instead of solving the underlying problem of the double invocation of
__sched_fork() for idle tasks, sched-ext decided to hack around the issue
by partially clearing out the entity struct to preserve the already
enqueued node. A provided analysis and solution has been ignored for four
months.
Now that someone else has taken care of cleaning it up, remove the
disgusting hack and clear out the full structure. Remove the comment in the
structure declaration as well, as there is no requirement for @node being
the last element anymore.
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/87ldy82wkc.ffs@tglx
Idle tasks are initialized via __sched_fork() twice:
fork_idle()
copy_process()
sched_fork()
__sched_fork()
init_idle()
__sched_fork()
Instead of cleaning this up, sched_ext hacked around it. Even when analyis
and solution were provided in a discussion, nobody cared to clean this up.
init_idle() is also invoked from sched_init() to initialize the boot CPU's
idle task, which requires the __sched_fork() invocation. But this can be
trivially solved by invoking __sched_fork() before init_idle() in
sched_init() and removing the __sched_fork() invocation from init_idle().
Do so and clean up the comments explaining this historical leftover.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241028103142.359584747@linutronix.de
What psi needs to do on each enqueue and dequeue has gotten more
subtle, and the generic sched code trying to distill this into a bool
for the callbacks is awkward.
Pass the flags directly and let psi parse them. For that to work, the
#include "stats.h" (which has the psi callback implementations) needs
to be below the flag definitions in "sched.h". Move that section
further down, next to some of the other accounting stuff.
This also puts the ENQUEUE_SAVE/RESTORE branch behind the psi jump
label, slightly reducing overhead when PSI=y but runtime disabled.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241014144358.GB1021@cmpxchg.org
Overlapping fixes solving the same bug slightly differently:
7266f0a6d3 fs/bcachefs: Fix __wait_on_freeing_inode() definition of waitqueue entry
3b80552e70 bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry
Use the upstream version.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduling fixes from Borislav Petkov:
- Add PREEMPT_RT maintainers
- Fix another aspect of delayed dequeued tasks wrt determining their
state, i.e., whether they're runnable or blocked
- Handle delayed dequeued tasks and their migration wrt PSI properly
- Fix the situation where a delayed dequeue task gets enqueued into a
new class, which should not happen
- Fix a case where memory allocation would happen while the runqueue
lock is held, which is a no-no
- Do not over-schedule when tasks with shorter slices preempt the
currently running task
- Make sure delayed to deque entities are properly handled before
unthrottling
- Other smaller cleanups and improvements
* tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Add an entry for PREEMPT_RT.
sched/fair: Fix external p->on_rq users
sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
sched/core: Dequeue PSI signals for blocked tasks that are delayed
sched: Fix delayed_dequeue vs switched_from_fair()
sched/core: Disable page allocation in task_tick_mm_cid()
sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()
sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
sched: Fix sched_delayed vs cfs_bandwidth
Pull ftrace fixes from Steven Rostedt:
"A couple of fixes to function graph infrastructure:
- Fix allocation of idle shadow stack allocation during hotplug
If function graph tracing is started when a CPU is offline, if it
were come online during the trace then the idle task that
represents the CPU will not get a shadow stack allocated for it.
This means all function graph hooks that happen while that idle
task is running (including in interrupt mode) will have all its
events dropped.
Switch over to the CPU hotplug mechanism that will have any newly
brought on line CPU get a callback that can allocate the shadow
stack for its idle task.
- Fix allocation size of the ret_stack_list array
When function graph tracing converted over to allowing more than
one user at a time, it had to convert its shadow stack from an
array of ret_stack structures to an array of unsigned longs. The
shadow stacks are allocated in batches of 32 at a time and assigned
to every running task. The batch is held by the ret_stack_list
array.
But when the conversion happened, instead of allocating an array of
32 pointers, it was allocated as a ret_stack itself (PAGE_SIZE).
This ret_stack_list gets passed to a function that iterates over
what it believes is its size defined by the
FTRACE_RETSTACK_ALLOC_SIZE macro (which is 32).
Luckily (PAGE_SIZE) is greater than 32 * sizeof(long), otherwise
this would have been an array overflow. This still should be fixed
and the ret_stack_list should be allocated to the size it is
expected to be as someday it may end up being bigger than
SHADOW_STACK_SIZE"
* tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fgraph: Allocate ret_stack_list with proper size
fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks
The ret_stack_list is an array of ret_stack shadow stacks for the function
graph usage. When the first function graph is enabled, all tasks in the
system get a shadow stack. The ret_stack_list is a 32 element array of
pointers to these shadow stacks. It allocates the shadow stack in batches
(32 stacks at a time), assigns them to running tasks, and continues until
all tasks are covered.
When the function graph shadow stack changed from an array of
ftrace_ret_stack structures to an array of longs, the allocation of
ret_stack_list went from allocating an array of 32 elements to just a
block defined by SHADOW_STACK_SIZE. Luckily, that's defined as PAGE_SIZE
and is much more than enough to hold 32 pointers. But it is way overkill
for the amount needed to allocate.
Change the allocation of ret_stack_list back to a kcalloc() of
FTRACE_RETSTACK_ALLOC_SIZE pointers.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241018215212.23f13f40@rorschach
Fixes: 42675b723b ("function_graph: Convert ret_stack to a series of longs")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function graph infrastructure allocates a shadow stack for every task
when enabled. This includes the idle tasks. The first time the function
graph is invoked, the shadow stacks are created and never freed until the
task exits. This includes the idle tasks.
Only the idle tasks that were for online CPUs had their shadow stacks
created when function graph tracing started. If function graph tracing is
enabled and a CPU comes online, the idle task representing that CPU will
not have its shadow stack created, and all function graph tracing for that
idle task will be silently dropped.
Instead, use the CPU hotplug mechanism to allocate the idle shadow stacks.
This will include idle tasks for CPUs that come online during tracing.
This issue can be reproduced by:
# cd /sys/kernel/tracing
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > set_ftrace_pid
# echo function_graph > current_tracer
# echo 1 > options/funcgraph-proc
# echo 1 > /sys/devices/system/cpu/cpu1
# grep '<idle>' per_cpu/cpu1/trace | head
Before, nothing would show up.
After:
1) <idle>-0 | 0.811 us | __enqueue_entity();
1) <idle>-0 | 5.626 us | } /* enqueue_entity */
1) <idle>-0 | | dl_server_update_idle_time() {
1) <idle>-0 | | dl_scaled_delta_exec() {
1) <idle>-0 | 0.450 us | arch_scale_cpu_capacity();
1) <idle>-0 | 1.242 us | }
1) <idle>-0 | 1.908 us | }
1) <idle>-0 | | dl_server_start() {
1) <idle>-0 | | enqueue_dl_entity() {
1) <idle>-0 | | task_contending() {
Note, if tracing stops and restarts, the old way would then initialize
the onlined CPUs.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20241018214300.6df82178@rorschach
Fixes: 868baf07b1 ("ftrace: Fix memory leak with function graph and cpu hotplug")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull bpf fixes from Daniel Borkmann:
- Fix BPF verifier to not affect subreg_def marks in its range
propagation (Eduard Zingerman)
- Fix a truncation bug in the BPF verifier's handling of
coerce_reg_to_size_sx (Dimitar Kanaliev)
- Fix the BPF verifier's delta propagation between linked registers
under 32-bit addition (Daniel Borkmann)
- Fix a NULL pointer dereference in BPF devmap due to missing rxq
information (Florian Kauer)
- Fix a memory leak in bpf_core_apply (Jiri Olsa)
- Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for
arrays of nested structs (Hou Tao)
- Fix build ID fetching where memory areas backing the file were
created with memfd_secret (Andrii Nakryiko)
- Fix BPF task iterator tid filtering which was incorrectly using pid
instead of tid (Jordan Rome)
- Several fixes for BPF sockmap and BPF sockhash redirection in
combination with vsocks (Michal Luczaj)
- Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered (Andrea Parri)
- Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility
of an infinite BPF tailcall (Pu Lehui)
- Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot
be resolved (Thomas Weißschuh)
- Fix a bug in kfunc BTF caching for modules where the wrong BTF object
was returned (Toke Høiland-Jørgensen)
- Fix a BPF selftest compilation error in cgroup-related tests with
musl libc (Tony Ambardar)
- Several fixes to BPF link info dumps to fill missing fields (Tyrone
Wu)
- Add BPF selftests for kfuncs from multiple modules, checking that the
correct kfuncs are called (Simon Sundberg)
- Ensure that internal and user-facing bpf_redirect flags don't overlap
(Toke Høiland-Jørgensen)
- Switch to use kvzmalloc to allocate BPF verifier environment (Rik van
Riel)
- Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat
under RT (Wander Lairson Costa)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (38 commits)
lib/buildid: Handle memfd_secret() files in build_id_parse()
selftests/bpf: Add test case for delta propagation
bpf: Fix print_reg_state's constant scalar dump
bpf: Fix incorrect delta propagation between linked registers
bpf: Properly test iter/task tid filtering
bpf: Fix iter/task tid filtering
riscv, bpf: Make BPF_CMPXCHG fully ordered
bpf, vsock: Drop static vsock_bpf_prot initialization
vsock: Update msg_count on read_skb()
vsock: Update rx_bytes on read_skb()
bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock
selftests/bpf: Add asserts for netfilter link info
bpf: Fix link info netfilter flags to populate defrag flag
selftests/bpf: Add test for sign extension in coerce_subreg_to_size_sx()
selftests/bpf: Add test for truncation after sign extension in coerce_reg_to_size_sx()
bpf: Fix truncation bug in coerce_reg_to_size_sx()
selftests/bpf: Assert link info uprobe_multi count & path_size if unset
bpf: Fix unpopulated path_size when uprobe_multi fields unset
selftests/bpf: Fix cross-compiling urandom_read
selftests/bpf: Add test for kfunc module order
...
Nathaniel reported a bug in the linked scalar delta tracking, which can lead
to accepting a program with OOB access. The specific code is related to the
sync_linked_regs() function and the BPF_ADD_CONST flag, which signifies a
constant offset between two scalar registers tracked by the same register id.
The verifier attempts to track "similar" scalars in order to propagate bounds
information learned about one scalar to others. For instance, if r1 and r2
are known to contain the same value, then upon encountering 'if (r1 != 0x1234)
goto xyz', not only does it know that r1 is equal to 0x1234 on the path where
that conditional jump is not taken, it also knows that r2 is.
Additionally, with env->bpf_capable set, the verifier will track scalars
which should be a constant delta apart (if r1 is known to be one greater than
r2, then if r1 is known to be equal to 0x1234, r2 must be equal to 0x1233.)
The code path for the latter in adjust_reg_min_max_vals() is reached when
processing both 32 and 64-bit addition operations. While adjust_reg_min_max_vals()
knows whether dst_reg was produced by a 32 or a 64-bit addition (based on the
alu32 bool), the only information saved in dst_reg is the id of the source
register (reg->id, or'ed by BPF_ADD_CONST) and the value of the constant
offset (reg->off).
Later, the function sync_linked_regs() will attempt to use this information
to propagate bounds information from one register (known_reg) to others,
meaning, for all R in linked_regs, it copies known_reg range (and possibly
adjusting delta) into R for the case of R->id == known_reg->id.
For the delta adjustment, meaning, matching reg->id with BPF_ADD_CONST, the
verifier adjusts the register as reg = known_reg; reg += delta where delta
is computed as (s32)reg->off - (s32)known_reg->off and placed as a scalar
into a fake_reg to then simulate the addition of reg += fake_reg. This is
only correct, however, if the value in reg was created by a 64-bit addition.
When reg contains the result of a 32-bit addition operation, its upper 32
bits will always be zero. sync_linked_regs() on the other hand, may cause
the verifier to believe that the addition between fake_reg and reg overflows
into those upper bits. For example, if reg was generated by adding the
constant 1 to known_reg using a 32-bit alu operation, then reg->off is 1
and known_reg->off is 0. If known_reg is known to be the constant 0xFFFFFFFF,
sync_linked_regs() will tell the verifier that reg is equal to the constant
0x100000000. This is incorrect as the actual value of reg will be 0, as the
32-bit addition will wrap around.
Example:
0: (b7) r0 = 0; R0_w=0
1: (18) r1 = 0x80000001; R1_w=0x80000001
3: (37) r1 /= 1; R1_w=scalar()
4: (bf) r2 = r1; R1_w=scalar(id=1) R2_w=scalar(id=1)
5: (bf) r4 = r1; R1_w=scalar(id=1) R4_w=scalar(id=1)
6: (04) w2 += 2147483647; R2_w=scalar(id=1+2147483647,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
7: (04) w4 += 0 ; R4_w=scalar(id=1+0,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
8: (15) if r2 == 0x0 goto pc+1
10: R0=0 R1=0xffffffff80000001 R2=0x7fffffff R4=0xffffffff80000001 R10=fp0
What can be seen here is that r1 is copied to r2 and r4, such that {r1,r2,r4}.id
are all the same which later lets sync_linked_regs() to be invoked. Then, in
a next step constants are added with alu32 to r2 and r4, setting their ->off,
as well as id |= BPF_ADD_CONST. Next, the conditional will bind r2 and
propagate ranges to its linked registers. The verifier now believes the upper
32 bits of r4 are r4=0xffffffff80000001, while actually r4=r1=0x80000001.
One approach for a simple fix suitable also for stable is to limit the constant
delta tracking to only 64-bit alu addition. If necessary at some later point,
BPF_ADD_CONST could be split into BPF_ADD_CONST64 and BPF_ADD_CONST32 to avoid
mixing the two under the tradeoff to further complicate sync_linked_regs().
However, none of the added tests from dedf56d775 ("selftests/bpf: Add tests
for add_const") make this necessary at this point, meaning, BPF CI also passes
with just limiting tracking to 64-bit alu addition.
Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Reported-by: Nathaniel Theis <nathaniel.theis@nccgroup.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20241016134913.32249-1-daniel@iogearbox.net
Pull networking fixes from Paolo Abeni:
"Current release - new code bugs:
- eth: mlx5: HWS, don't destroy more bwc queue locks than allocated
Previous releases - regressions:
- ipv4: give an IPv4 dev to blackhole_netdev
- udp: compute L4 checksum as usual when not segmenting the skb
- tcp/dccp: don't use timer_pending() in reqsk_queue_unlink().
- eth: mlx5e: don't call cleanup on profile rollback failure
- eth: microchip: vcap api: fix memory leaks in
vcap_api_encode_rule_test()
- eth: enetc: disable Tx BD rings after they are empty
- eth: macb: avoid 20s boot delay by skipping MDIO bus registration
for fixed-link PHY
Previous releases - always broken:
- posix-clock: fix missing timespec64 check in pc_clock_settime()
- genetlink: hold RCU in genlmsg_mcast()
- mptcp: prevent MPC handshake on port-based signal endpoints
- eth: vmxnet3: fix packet corruption in vmxnet3_xdp_xmit_frame
- eth: stmmac: dwmac-tegra: fix link bring-up sequence
- eth: bcmasp: fix potential memory leak in bcmasp_xmit()
Misc:
- add Andrew Lunn as a co-maintainer of all networking drivers"
* tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
net/mlx5e: Don't call cleanup on profile rollback failure
net/mlx5: Unregister notifier on eswitch init failure
net/mlx5: Fix command bitmask initialization
net/mlx5: Check for invalid vector index on EQ creation
net/mlx5: HWS, use lock classes for bwc locks
net/mlx5: HWS, don't destroy more bwc queue locks than allocated
net/mlx5: HWS, fixed double free in error flow of definer layout
net/mlx5: HWS, removed wrong access to a number of rules variable
mptcp: pm: fix UaF read in mptcp_pm_nl_rm_addr_or_subflow
net: ethernet: mtk_eth_soc: fix memory corruption during fq dma init
vmxnet3: Fix packet corruption in vmxnet3_xdp_xmit_frame
net: dsa: vsc73xx: fix reception from VLAN-unaware bridges
net: ravb: Only advertise Rx/Tx timestamps if hardware supports it
net: microchip: vcap api: Fix memory leaks in vcap_api_encode_rule_test()
net: phy: mdio-bcm-unimac: Add BCM6846 support
dt-bindings: net: brcm,unimac-mdio: Add bcm6846-mdio
udp: Compute L4 checksum as usual when not segmenting the skb
genetlink: hold RCU in genlmsg_mcast()
net: dsa: mv88e6xxx: Fix the max_vid definition for the MV88E6361
tcp/dccp: Don't use timer_pending() in reqsk_queue_unlink().
...
Conflicts:
kernel/sched/ext.c
There's a context conflict between this upstream commit:
3fdb9ebcec sched_ext: Start schedulers with consistent p->scx.slice values
... and this fix in sched/urgent:
98442f0ccd sched: Fix delayed_dequeue vs switched_from_fair()
Resolve it.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull sched_ext fixes from Tejun Heo:
- More issues reported in the enable/disable paths on large machines
with many tasks due to scx_tasks_lock being held too long. Break up
the task iterations
- Remove ops.select_cpu() dependency in bypass mode so that a
misbehaving implementation can't live-lock the machine by pushing all
tasks to few CPUs in bypass mode
- Other misc fixes
* tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Remove unnecessary cpu_relax()
sched_ext: Don't hold scx_tasks_lock for too long
sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers
sched_ext: bypass mode shouldn't depend on ops.select_cpu()
sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl()
sched_ext: Start schedulers with consistent p->scx.slice values
Revert "sched_ext: Use shorter slice while bypassing"
sched_ext: use correct function name in pick_task_scx() warning message
selftests: sched_ext: Add sched_ext as proper selftest target
Pull ring-buffer fixes from Steven Rostedt:
- Fix ref counter of buffers assigned at boot up
A tracing instance can be created from the kernel command line. If it
maps to memory, it is considered permanent and should not be deleted,
or bad things can happen. If it is not mapped to memory, then the
user is fine to delete it via rmdir from the instances directory. But
the ref counts assumed 0 was free to remove and greater than zero was
not. But this was not the case. When an instance is created, it
should have the reference of 1, and if it should not be removed, it
must be greater than 1. The boot up code set normal instances with a
ref count of 0, which could get removed if something accessed it and
then released it. And memory mapped instances had a ref count of 1
which meant it could be deleted, and bad things happen. Keep normal
instances ref count as 1, and set memory mapped instances ref count
to 2.
- Protect sub buffer size (order) updates from other modifications
When a ring buffer is changing the size of its sub-buffers, no other
operations should be performed on the ring buffer. That includes
reading it. But the locking only grabbed the buffer->mutex that keeps
some operations from touching the ring buffer. It also must hold the
cpu_buffer->reader_lock as well when updates happen as other paths
use that to do some operations on the ring buffer.
* tag 'trace-ringbuffer-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Fix reader locking when changing the sub buffer order
ring-buffer: Fix refcount setting of boot mapped buffers
coerce_reg_to_size_sx() updates the register state after a sign-extension
operation. However, there's a bug in the assignment order of the unsigned
min/max values, leading to incorrect truncation:
0: (85) call bpf_get_prandom_u32#7 ; R0_w=scalar()
1: (57) r0 &= 1 ; R0_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1))
2: (07) r0 += 254 ; R0_w=scalar(smin=umin=smin32=umin32=254,smax=umax=smax32=umax32=255,var_off=(0xfe; 0x1))
3: (bf) r0 = (s8)r0 ; R0_w=scalar(smin=smin32=-2,smax=smax32=-1,umin=umin32=0xfffffffe,umax=0xffffffff,var_off=(0xfffffffffffffffe; 0x1))
In the current implementation, the unsigned 32-bit min/max values
(u32_min_value and u32_max_value) are assigned directly from the 64-bit
signed min/max values (s64_min and s64_max):
reg->umin_value = reg->u32_min_value = s64_min;
reg->umax_value = reg->u32_max_value = s64_max;
Due to the chain assigmnent, this is equivalent to:
reg->u32_min_value = s64_min; // Unintended truncation
reg->umin_value = reg->u32_min_value;
reg->u32_max_value = s64_max; // Unintended truncation
reg->umax_value = reg->u32_max_value;
Fixes: 1f9a1ea821 ("bpf: Support new sign-extension load insns")
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Dimitar Kanaliev <dimitar.kanaliev@siteground.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20241014121155.92887-2-dimitar.kanaliev@siteground.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As Andrew pointed out, it will make sense that the PTP core
checked timespec64 struct's tv_sec and tv_nsec range before calling
ptp->info->settime64().
As the man manual of clock_settime() said, if tp.tv_sec is negative or
tp.tv_nsec is outside the range [0..999,999,999], it should return EINVAL,
which include dynamic clocks which handles PTP clock, and the condition is
consistent with timespec64_valid(). As Thomas suggested, timespec64_valid()
only check the timespec is valid, but not ensure that the time is
in a valid range, so check it ahead using timespec64_valid_strict()
in pc_clock_settime() and return -EINVAL if not valid.
There are some drivers that use tp->tv_sec and tp->tv_nsec directly to
write registers without validity checks and assume that the higher layer
has checked it, which is dangerous and will benefit from this, such as
hclge_ptp_settime(), igb_ptp_settime_i210(), _rcar_gen4_ptp_settime(),
and some drivers can remove the checks of itself.
Cc: stable@vger.kernel.org
Fixes: 0606f422b4 ("posix clocks: Introduce dynamic clocks")
Acked-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20241009072302.1754567-2-ruanjinjie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As described in commit b07996c7ab ("sched_ext: Don't hold
scx_tasks_lock for too long"), we're doing a cond_resched() every 32
calls to scx_task_iter_next() to avoid RCU and other stalls. That commit
also added a cpu_relax() to the codepath where we drop and reacquire the
lock, but as Waiman described in [0], cpu_relax() should only be
necessary in busy loops to avoid pounding on a cacheline (or to allow a
hypertwin to more fully utilize a core).
Let's remove the unnecessary cpu_relax().
[0]: https://lore.kernel.org/all/35b3889b-904a-4d26-981f-c8aa1557a7c7@redhat.com/
Cc: Waiman Long <llong@redhat.com>
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>