Commit Graph

45917 Commits

Author SHA1 Message Date
Linus Torvalds
622a3ed1ac Merge tag 'trace-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Fix tp_printk command line option crashing the kernel

   With the code that can handle a buffer from a previous boot, the
   trace_check_vprintf() needed access to the delta of the address space
   used by the old buffer and the current buffer. To do so, the
   trace_array (tr) parameter was used. But when tp_printk is enabled on
   the kernel command line, no trace buffer is used and the trace event
   is sent directly to printk(). That meant the tr field of the iterator
   descriptor was NULL, and since tp_printk still uses
   trace_check_vprintf() it caused a NULL dereference.

 - Add ptrace.h include to x86 ftrace file for completeness

 - Fix rtla installation when done with out-of-tree build

 - Fix the help messages in rtla that were incorrect

 - Several fixes to fix races with the timerlat and hwlat code

   Several locking issues were discovered with the coordination between
   timerlat kthread creation and hotplug. As timerlat has callbacks from
   hotplug code to start kthreads when CPUs come online. There are also
   locking issues with grabbing the cpu_read_lock() and the locks within
   timerlat.

* tag 'trace-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/hwlat: Fix a race during cpuhp processing
  tracing/timerlat: Fix a race during cpuhp processing
  tracing/timerlat: Drop interface_lock in stop_kthread()
  tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline
  x86/ftrace: Include <asm/ptrace.h>
  rtla: Fix the help text in osnoise and timerlat top tools
  tools/rtla: Fix installation from out-of-tree build
  tracing: Fix trace_check_vprintf() when tp_printk is used
2024-10-04 12:11:06 -07:00
Linus Torvalds
f6785e0ccf Merge tag 'slab-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
 "Fixes for issues introduced in this merge window: kobject memory leak,
  unsupressed warning and possible lockup in new slub_kunit tests,
  misleading code in kvfree_rcu_queue_batch()"

* tag 'slab-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  slub/kunit: skip test_kfree_rcu when the slub kunit test is built-in
  mm, slab: suppress warnings in test_leak_destroy kunit test
  rcu/kvfree: Refactor kvfree_rcu_queue_batch()
  mm, slab: fix use of SLAB_SUPPORTS_SYSFS in kmem_cache_release()
2024-10-04 12:05:39 -07:00
Linus Torvalds
6cca119588 Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull close_range() fix from Al Viro:
 "Fix the logic in descriptor table trimming"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  close_range(): fix the logics in descriptor table trimming
2024-10-04 09:46:16 -07:00
Johannes Weiner
3840cbe24c sched: psi: fix bogus pressure spikes from aggregation race
Brandon reports sporadic, non-sensical spikes in cumulative pressure
time (total=) when reading cpu.pressure at a high rate. This is due to
a race condition between reader aggregation and tasks changing states.

While it affects all states and all resources captured by PSI, in
practice it most likely triggers with CPU pressure, since scheduling
events are so frequent compared to other resource events.

The race context is the live snooping of ongoing stalls during a
pressure read. The read aggregates per-cpu records for stalls that
have concluded, but will also incorporate ad-hoc the duration of any
active state that hasn't been recorded yet. This is important to get
timely measurements of ongoing stalls. Those ad-hoc samples are
calculated on-the-fly up to the current time on that CPU; since the
stall hasn't concluded, it's expected that this is the minimum amount
of stall time that will enter the per-cpu records once it does.

The problem is that the path that concludes the state uses a CPU clock
read that is not synchronized against aggregators; the clock is read
outside of the seqlock protection. This allows aggregators to race and
snoop a stall with a longer duration than will actually be recorded.

With the recorded stall time being less than the last snapshot
remembered by the aggregator, a subsequent sample will underflow and
observe a bogus delta value, resulting in an erratic jump in pressure.

Fix this by moving the clock read of the state change into the seqlock
protection. This ensures no aggregation can snoop live stalls past the
time that's recorded when the state concludes.

Reported-by: Brandon Duffany <brandon@buildbuddy.io>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/
Fixes: df77430639 ("psi: Reduce calls to sched_clock() in psi")
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-03 16:03:16 -07:00
Wei Li
2a13ca2e8a tracing/hwlat: Fix a race during cpuhp processing
The cpuhp online/offline processing race also exists in percpu-mode hwlat
tracer in theory, apply the fix too. That is:

    T1                       | T2
    [CPUHP_ONLINE]           | cpu_device_down()
     hwlat_hotplug_workfn()  |
                             |     cpus_write_lock()
                             |     takedown_cpu(1)
                             |     cpus_write_unlock()
    [CPUHP_OFFLINE]          |
        cpus_read_lock()     |
        start_kthread(1)     |
        cpus_read_unlock()   |

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-5-liwei391@huawei.com
Fixes: ba998f7d95 ("trace/hwlat: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-03 16:43:23 -04:00
Wei Li
829e0c9f08 tracing/timerlat: Fix a race during cpuhp processing
There is another found exception that the "timerlat/1" thread was
scheduled on CPU0, and lead to timer corruption finally:

```
ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220
WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0
Modules linked in:
CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:debug_print_object+0x7d/0xb0
...
Call Trace:
 <TASK>
 ? __warn+0x7c/0x110
 ? debug_print_object+0x7d/0xb0
 ? report_bug+0xf1/0x1d0
 ? prb_read_valid+0x17/0x20
 ? handle_bug+0x3f/0x70
 ? exc_invalid_op+0x13/0x60
 ? asm_exc_invalid_op+0x16/0x20
 ? debug_print_object+0x7d/0xb0
 ? debug_print_object+0x7d/0xb0
 ? __pfx_timerlat_irq+0x10/0x10
 __debug_object_init+0x110/0x150
 hrtimer_init+0x1d/0x60
 timerlat_main+0xab/0x2d0
 ? __pfx_timerlat_main+0x10/0x10
 kthread+0xb7/0xe0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
```

After tracing the scheduling event, it was discovered that the migration
of the "timerlat/1" thread was performed during thread creation. Further
analysis confirmed that it is because the CPU online processing for
osnoise is implemented through workers, which is asynchronous with the
offline processing. When the worker was scheduled to create a thread, the
CPU may has already been removed from the cpu_online_mask during the offline
process, resulting in the inability to select the right CPU:

T1                       | T2
[CPUHP_ONLINE]           | cpu_device_down()
osnoise_hotplug_workfn() |
                         |     cpus_write_lock()
                         |     takedown_cpu(1)
                         |     cpus_write_unlock()
[CPUHP_OFFLINE]          |
    cpus_read_lock()     |
    start_kthread(1)     |
    cpus_read_unlock()   |

To fix this, skip online processing if the CPU is already offline.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com
Fixes: c8895e271f ("trace/osnoise: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-03 16:43:22 -04:00
Wei Li
b484a02c9c tracing/timerlat: Drop interface_lock in stop_kthread()
stop_kthread() is the offline callback for "trace/osnoise:online", since
commit 5bfbcd1ee5 ("tracing/timerlat: Add interface_lock around clearing
of kthread in stop_kthread()"), the following ABBA deadlock scenario is
introduced:

T1                            | T2 [BP]               | T3 [AP]
osnoise_hotplug_workfn()      | work_for_cpu_fn()     | cpuhp_thread_fun()
                              |   _cpu_down()         |   osnoise_cpu_die()
  mutex_lock(&interface_lock) |                       |     stop_kthread()
                              |     cpus_write_lock() |       mutex_lock(&interface_lock)
  cpus_read_lock()            |     cpuhp_kick_ap()   |

As the interface_lock here in just for protecting the "kthread" field of
the osn_var, use xchg() instead to fix this issue. Also use
for_each_online_cpu() back in stop_per_cpu_kthreads() as it can take
cpu_read_lock() again.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-3-liwei391@huawei.com
Fixes: 5bfbcd1ee5 ("tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-03 16:43:22 -04:00
Wei Li
0bb0a5c12e tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline
osnoise_hotplug_workfn() is the asynchronous online callback for
"trace/osnoise:online". It may be congested when a CPU goes online and
offline repeatedly and is invoked for multiple times after a certain
online.

This will lead to kthread leak and timer corruption. Add a check
in start_kthread() to prevent this situation.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-2-liwei391@huawei.com
Fixes: c8895e271f ("trace/osnoise: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-03 16:43:22 -04:00
Steven Rostedt
50a3242d84 tracing: Fix trace_check_vprintf() when tp_printk is used
When the tp_printk kernel command line is used, the trace events go
directly to printk(). It is still checked via the trace_check_vprintf()
function to make sure the pointers of the trace event are legit.

The addition of reading buffers from previous boots required adding a
delta between the addresses of the previous boot and the current boot so
that the pointers in the old buffer can still be used. But this required
adding a trace_array pointer to acquire the delta offsets.

The tp_printk code does not provide a trace_array (tr) pointer, so when
the offsets were examined, a NULL pointer dereference happened and the
kernel crashed.

If the trace_array does not exist, just default the delta offsets to zero,
as that also means the trace event is not being read from a previous boot.

Link: https://lore.kernel.org/all/Zv3z5UsG_jsO9_Tb@aschofie-mobl2.lan/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241003104925.4e1b1fd9@gandalf.local.home
Fixes: 07714b4bb3 ("tracing: Handle old buffer mappings for event strings and functions")
Reported-by: Alison Schofield <alison.schofield@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-03 16:43:22 -04:00
Linus Torvalds
7ec462100e Merge tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull generic unaligned.h cleanups from Al Viro:
 "Get rid of architecture-specific <asm/unaligned.h> includes, replacing
  them with a single generic <linux/unaligned.h> header file.

  It's the second largest (after asm/io.h) class of asm/* includes, and
  all but two architectures actually end up using exact same file.

  Massage the remaining two (arc and parisc) to do the same and just
  move the thing to from asm-generic/unaligned.h to linux/unaligned.h"

[ This is one of those things that we're better off doing outside the
  merge window, and would only cause extra conflict noise if it was in
  linux-next for the next release due to all the trivial #include line
  updates.  Rip off the band-aid.   - Linus ]

* tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  move asm/unaligned.h to linux/unaligned.h
  arc: get rid of private asm/unaligned.h
  parisc: get rid of private asm/unaligned.h
2024-10-02 16:42:28 -07:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Uladzislau Rezki (Sony)
3c5d61ae91 rcu/kvfree: Refactor kvfree_rcu_queue_batch()
Improve readability of kvfree_rcu_queue_batch() function
in away that, after a first batch queuing, the loop is break
and success value is returned to a caller.

There is no reason to loop and check batches further as all
outstanding objects have already been picked and attached to
a certain batch to complete an offloading.

Fixes: 2b55d6a42d ("rcu/kvfree: Add kvfree_rcu_barrier() API")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Closes: https://lore.kernel.org/lkml/ZvWUt2oyXRsvJRNc@pc636/T/
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-10-01 18:30:42 +02:00
Linus Torvalds
e32cde8d2b Merge tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:

 - When sched_ext is in bypass mode (e.g. while disabling the BPF
   scheduler), it was using one DSQ to implement global FIFO scheduling
   as all it has to do is guaranteeing reasonable forward progress.

   On multi-socket machines, this can lead to live-lock conditions under
   certain workloads. Fixed by splitting the queue used for FIFO
   scheduling per NUMA node. This required several preparation patches.

 - Hotplug tests on powerpc could reliably trigger deadlock while
   enabling a BPF scheduler.

   This was caused by cpu_hotplug_lock nesting inside scx_fork_rwsem and
   then CPU hotplug path trying to fork a new thread while holding
   cpu_hotplug_lock.

   Fixed by restructuring locking in enable and disable paths so that
   the two locks are not coupled. This required several preparation
   patches which also fixed a couple other issues in the enable path.

 - A build fix for !CONFIG_SMP

 - Userspace tooling sync and updates

* tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Remove redundant p->nr_cpus_allowed checker
  sched_ext: Decouple locks in scx_ops_enable()
  sched_ext: Decouple locks in scx_ops_disable_workfn()
  sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()
  sched_ext: Enable scx_ops_init_task() separately
  sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()
  sched_ext: Initialize in bypass mode
  sched_ext: Remove SCX_OPS_PREPPING
  sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()
  sched_ext: Use shorter slice while bypassing
  sched_ext: Split the global DSQ per NUMA node
  sched_ext: Relocate find_user_dsq()
  sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
  scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL
  tools/sched_ext: Receive misc updates from SCX repo
  sched_ext: Add __COMPAT helpers for features added during v6.12 devel cycle
  sched_ext: Build fix for !CONFIG_SMP
2024-09-30 12:58:17 -07:00
Linus Torvalds
190ecde722 Merge tag 'probes-fixes-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fix from Masami Hiramatsu:

 - uprobes: fix kernel info leak via "[uprobes]" vma

   Fix uprobes not to expose the uninitialized page for trampoline
   buffer to user space, which can leak kernel info.

* tag 'probes-fixes-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  uprobes: fix kernel info leak via "[uprobes]" vma
2024-09-30 11:06:01 -07:00
Al Viro
678379e1d4 close_range(): fix the logics in descriptor table trimming
Cloning a descriptor table picks the size that would cover all currently
opened files.  That's fine for clone() and unshare(), but for close_range()
there's an additional twist - we clone before we close, and it would be
a shame to have
	close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
leave us with a huge descriptor table when we are not going to keep
anything past stderr, just because some large file descriptor used to
be open before our call has taken it out.

Unfortunately, it had been dealt with in an inherently racy way -
sane_fdtable_size() gets a "don't copy anything past that" argument
(passed via unshare_fd() and dup_fd()), close_range() decides how much
should be trimmed and passes that to unshare_fd().

The problem is, a range that used to extend to the end of descriptor
table back when close_range() had looked at it might very well have stuff
grown after it by the time dup_fd() has allocated a new files_struct
and started to figure out the capacity of fdtable to be attached to that.

That leads to interesting pathological cases; at the very least it's a
QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
a snapshot of descriptor table one might have observed at some point.
Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
of unshare(CLONE_FILES) with plain close_range(), ending up with a
weird state that would never occur with unshare(2) is confusing, to put
it mildly.

It's not hard to get rid of - all it takes is passing both ends of the
range down to sane_fdtable_size().  There we are under ->files_lock,
so the race is trivially avoided.

So we do the following:
	* switch close_files() from calling unshare_fd() to calling
dup_fd().
	* undo the calling convention change done to unshare_fd() in
60997c3d45 "close_range: add CLOSE_RANGE_UNSHARE"
	* introduce struct fd_range, pass a pointer to that to dup_fd()
and sane_fdtable_size() instead of "trim everything past that point"
they are currently getting.  NULL means "we are not going to be punching
any holes"; NR_OPEN_MAX is gone.
	* make sane_fdtable_size() use find_last_bit() instead of
open-coding it; it's easier to follow that way.
	* while we are at it, have dup_fd() report errors by returning
ERR_PTR(), no need to use a separate int *errorp argument.

Fixes: 60997c3d45 "close_range: add CLOSE_RANGE_UNSHARE"
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-09-29 21:52:29 -04:00
Oleg Nesterov
34820304cc uprobes: fix kernel info leak via "[uprobes]" vma
xol_add_vma() maps the uninitialized page allocated by __create_xol_area()
into userspace. On some architectures (x86) this memory is readable even
without VM_READ, VM_EXEC results in the same pgprot_t as VM_EXEC|VM_READ,
although this doesn't really matter, debugger can read this memory anyway.

Link: https://lore.kernel.org/all/20240929162047.GA12611@redhat.com/

Reported-by: Will Deacon <will@kernel.org>
Fixes: d4b3b6384f ("uprobes/core: Allocate XOL slots for uprobes use")
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-30 08:19:11 +09:00
Linus Torvalds
ec03de73b1 Merge tag 'locking-urgent-2024-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "lockdep:
    - Fix potential deadlock between lockdep and RCU (Zhiguo Niu)
    - Use str_plural() to address Coccinelle warning (Thorsten Blum)
    - Add debuggability enhancement (Luis Claudio R. Goncalves)

  static keys & calls:
    - Fix static_key_slow_dec() yet again (Peter Zijlstra)
    - Handle module init failure correctly in static_call_del_module()
      (Thomas Gleixner)
    - Replace pointless WARN_ON() in static_call_module_notify() (Thomas
      Gleixner)

  <linux/cleanup.h>:
    - Add usage and style documentation (Dan Williams)

  rwsems:
    - Move is_rwsem_reader_owned() and rwsem_owner() under
      CONFIG_DEBUG_RWSEMS (Waiman Long)

  atomic ops, x86:
    - Redeclare x86_32 arch_atomic64_{add,sub}() as void (Uros Bizjak)
    - Introduce the read64_nonatomic macro to x86_32 with cx8 (Uros
      Bizjak)"

Signed-off-by: Ingo Molnar <mingo@kernel.org>

* tag 'locking-urgent-2024-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Move is_rwsem_reader_owned() and rwsem_owner() under CONFIG_DEBUG_RWSEMS
  jump_label: Fix static_key_slow_dec() yet again
  static_call: Replace pointless WARN_ON() in static_call_module_notify()
  static_call: Handle module init failure correctly in static_call_del_module()
  locking/lockdep: Simplify character output in seq_line()
  lockdep: fix deadlock issue between lockdep and rcu
  lockdep: Use str_plural() to fix Coccinelle warning
  cleanup: Add usage and style documentation
  lockdep: suggest the fix for "lockdep bfs error:-1" on print_bfs_bug
  locking/atomic/x86: Redeclare x86_32 arch_atomic64_{add,sub}() as void
  locking/atomic/x86: Introduce the read64_nonatomic macro to x86_32 with cx8
2024-09-29 08:51:30 -07:00
Ingo Molnar
ae39e0bd15 Merge branch 'locking/core' into locking/urgent, to pick up pending commits
Merge all pending locking commits into a single branch.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-09-29 08:57:18 +02:00
Linus Torvalds
6f81a446f8 Merge tag 'modules-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
 "There are a few fixes / cleanups from Vincent, Chunhui, and Petr, but
  the most important part of this pull request is the Rust community
  stepping up to help maintain both C / Rust code for future Rust module
  support. We grow the set of modules maintainers by three now, and with
  this hope to scale to help address what's needed to properly support
  future Rust module support.

  A lot of exciting stuff coming in future kernel releases.

  This has been on linux-next for ~ 3 weeks now with no issues"

* tag 'modules-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  module: Refine kmemleak scanned areas
  module: abort module loading when sysfs setup suffer errors
  MAINTAINERS: scale modules with more reviewers
  module: Clean up the description of MODULE_SIG_<type>
  module: Split modules_install compression and in-kernel decompression
2024-09-28 09:06:15 -07:00
Zhang Qiao
95b873693a sched_ext: Remove redundant p->nr_cpus_allowed checker
select_rq_task() already checked that 'p->nr_cpus_allowed > 1',
'p->nr_cpus_allowed == 1' checker in scx_select_cpu_dfl() is redundant.

Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27 10:23:45 -10:00
Tejun Heo
efe231d9de sched_ext: Decouple locks in scx_ops_enable()
The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems.

For example, currently, there is a possible deadlock involving
scx_fork_rwsem and cpus_read_lock. cpus_read_lock has to nest inside
scx_fork_rwsem due to locking order existing in other subsystems. However,
there exists a dependency in the other direction during hotplug if hotplug
needs to fork a new task, which happens in some cases. This leads to the
following deadlock:

       scx_ops_enable()                               hotplug

                                          percpu_down_write(&cpu_hotplug_lock)
   percpu_down_write(&scx_fork_rwsem)
   block on cpu_hotplug_lock
                                          kthread_create() waits for kthreadd
					  kthreadd blocks on scx_fork_rwsem

Note that this doesn't trigger lockdep because the hotplug side dependency
bounces through kthreadd.

With the preceding scx_cgroup_enabled change, this can be solved by
decoupling cpus_read_lock, which is needed for static_key manipulations,
from the other two locks.

- Move the first block of static_key manipulations outside of scx_fork_rwsem
  and scx_cgroup_rwsem. This is now safe with the preceding
  scx_cgroup_enabled change.

- Drop scx_cgroup_rwsem and scx_fork_rwsem between the two task iteration
  blocks so that __scx_ops_enabled static_key enabling is outside the two
  rwsems.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
2024-09-27 10:02:40 -10:00
Tejun Heo
160216568c sched_ext: Decouple locks in scx_ops_disable_workfn()
The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems. With the preceding scx_cgroup_enabled change, we can
decouple them:

- As cgroup disabling no longer requires modifying a static_key which
  requires cpus_read_lock(), no need to grab cpus_read_lock() before
  grabbing scx_cgroup_rwsem.

- cgroup can now be independently disabled before tasks are moved back to
  the fair class.

Relocate scx_cgroup_exit() invocation before scx_fork_rwsem is grabbed, drop
now unnecessary cpus_read_lock() and move static_key operations out of
scx_fork_rwsem. This decouples all three locks in the disable path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
2024-09-27 10:02:40 -10:00
Tejun Heo
568894edbe sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()
If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online()
didn't set SCX_TG_INITED which meant that ops.cgroup_exit(), even if
implemented, won't be called from scx_tg_offline(). This is because
SCX_HAS_OP(cgroupt_init) is used to test both whether SCX cgroup operations
are enabled and ops.cgroup_init() exists.

Fix it by introducing a separate bool scx_cgroup_enabled to gate cgroup
operations and use SCX_HAS_OP(cgroup_init) only to test whether
ops.cgroup_init() exists. Make all cgroup operations consistently use
scx_cgroup_enabled to test whether cgroup operations are enabled.
scx_cgroup_enabled is added instead of using scx_enabled() to ease planned
locking updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27 10:02:40 -10:00
Tejun Heo
4269c603cc sched_ext: Enable scx_ops_init_task() separately
scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path
were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned
on before the first scx_ops_init_task() loop in scx_ops_enable(). However,
if an external entity causes sched_class switch before the loop is complete,
tasks which are not initialized could be switched to SCX.

The following can be reproduced by running a program which keeps toggling a
process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2).

  sched_ext: Invalid task state transition 0 -> 3 for fish[1623]
  WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200
  ...
  Sched_ext: simple (enabling)
  RIP: 0010:scx_ops_enable_task+0x1a1/0x200
  ...
   switching_to_scx+0x13/0xa0
   __sched_setscheduler+0x850/0xa50
   do_sched_setscheduler+0x104/0x1c0
   __x64_sys_sched_setscheduler+0x18/0x30
   do_syscall_64+0x7b/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by gating scx_ops_init_task() separately using
scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are
finished with scx_ops_init_task().

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27 10:02:40 -10:00
Tejun Heo
9753358a6a sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()
scx_ops_enable() has two task iteration loops. The first one calls
scx_ops_init_task() on every task and the latter switches the eligible ones
into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the
second loop switched it into READY before switching the task into SCX.

The distinction between INIT and READY is only meaningful in the fork path
where it's used to tell whether the task finished forking so that we can
tell ops.exit_task() accordingly. Leaving task in INIT state between the two
loops is incosistent with the fork path and incorrect. The following can be
triggered by running a program which keeps toggling a task between
SCHED_OTHER and SCHED_SCX while enabling a task:

  sched_ext: Invalid task state transition 1 -> 3 for fish[1526]
  WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200
  ...
  Sched_ext: qmap (enabling+all)
  RIP: 0010:scx_ops_enable_task+0x1a1/0x200
  ...
   switching_to_scx+0x13/0xa0
   __sched_setscheduler+0x850/0xa50
   do_sched_setscheduler+0x104/0x1c0
   __x64_sys_sched_setscheduler+0x18/0x30
   do_syscall_64+0x7b/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by transitioning to READY in the first loop right after
scx_ops_init_task() succeeds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
2024-09-27 10:02:40 -10:00