Commit Graph

1991 Commits

Author SHA1 Message Date
Linus Torvalds
67850b7bdc Merge tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace_stop cleanups from Eric Biederman:
 "While looking at the ptrace problems with PREEMPT_RT and the problems
  Peter Zijlstra was encountering with ptrace in his freezer rewrite I
  identified some cleanups to ptrace_stop that make sense on their own
  and move make resolving the other problems much simpler.

  The biggest issue is the habit of the ptrace code to change
  task->__state from the tracer to suppress TASK_WAKEKILL from waking up
  the tracee. No other code in the kernel does that and it is straight
  forward to update signal_wake_up and friends to make that unnecessary.

  Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and
  then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying
  on the fact that all stopped states except the special stop states can
  tolerate spurious wake up and recover their state.

  The state of stopped and traced tasked is changed to be stored in
  task->jobctl as well as in task->__state. This makes it possible for
  the freezer to recover tasks in these special states, as well as
  serving as a general cleanup. With a little more work in that
  direction I believe TASK_STOPPED can learn to tolerate spurious wake
  ups and become an ordinary stop state.

  The TASK_TRACED state has to remain a special state as the registers
  for a process are only reliably available when the process is stopped
  in the scheduler. Fundamentally ptrace needs acess to the saved
  register values of a task.

  There are bunch of semi-random ptrace related cleanups that were found
  while looking at these issues.

  One cleanup that deserves to be called out is from commit 57b6de08b5
  ("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs"). This
  makes a change that is technically user space visible, in the handling
  of what happens to a tracee when a tracer dies unexpectedly. According
  to our testing and our understanding of userspace nothing cares that
  spurious SIGTRAPs can be generated in that case"

* tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
  ptrace: Always take siglock in ptrace_resume
  ptrace: Don't change __state
  ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
  ptrace: Document that wait_task_inactive can't fail
  ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
  signal: Use lockdep_assert_held instead of assert_spin_locked
  ptrace: Remove arch_ptrace_attach
  ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
  ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
  signal: Replace __group_send_sig_info with send_signal_locked
  signal: Rename send_signal send_signal_locked
2022-06-03 16:13:25 -07:00
Linus Torvalds
6f3f04c190 Merge tag 'sched-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Updates to scheduler metrics:
     - PELT fixes & enhancements
     - PSI fixes & enhancements
     - Refactor cpu_util_without()

 - Updates to instrumentation/debugging:
     - Remove sched_trace_*() helper functions - can be done via debug
       info
     - Fix double update_rq_clock() warnings

 - Introduce & use "preemption model accessors" to simplify some of the
   Kconfig complexity.

 - Make softirq handling RT-safe.

 - Misc smaller fixes & cleanups.

* tag 'sched-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  topology: Remove unused cpu_cluster_mask()
  sched: Reverse sched_class layout
  sched/deadline: Remove superfluous rq clock update in push_dl_task()
  sched/core: Avoid obvious double update_rq_clock warning
  smp: Make softirq handling RT safe in flush_smp_call_function_queue()
  smp: Rename flush_smp_call_function_from_idle()
  sched: Fix missing prototype warnings
  sched/fair: Remove cfs_rq_tg_path()
  sched/fair: Remove sched_trace_*() helper functions
  sched/fair: Refactor cpu_util_without()
  sched/fair: Revise comment about lb decision matrix
  sched/psi: report zeroes for CPU full at the system level
  sched/fair: Delete useless condition in tg_unthrottle_up()
  sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rq
  sched/fair: Move calculate of avg_load to a better location
  mailmap: Update my email address to @redhat.com
  MAINTAINERS: Add myself as scheduler topology reviewer
  psi: Fix trigger being fired unexpectedly at initial
  ftrace: Use preemption model accessors for trace header printout
  kcsan: Use preemption model accessors
2022-05-24 11:11:13 -07:00
Linus Torvalds
3e2cbc016b Merge tag 'x86_splitlock_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 splitlock updates from Borislav Petkov:

 - Add Raptor Lake to the set of CPU models which support splitlock

 - Make life miserable for apps using split locks by slowing them down
   considerably while the rest of the system remains responsive. The
   hope is it will hurt more and people will really fix their misaligned
   locks apps. As a result, free a TIF bit.

* tag 'x86_splitlock_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/split_lock: Enable the split lock feature on Raptor Lake
  x86/split-lock: Remove unused TIF_SLD bit
  x86/split_lock: Make life miserable for split lockers
2022-05-23 19:24:47 -07:00
Linus Torvalds
1e57930e9f Merge tag 'rcu.2022.05.19a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU update from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes

 - Callback-offloading updates, mainly simplifications

 - RCU-tasks updates, including some -rt fixups, handling of systems
   with sparse CPU numbering, and a fix for a boot-time race-condition
   failure

 - Put SRCU on a memory diet in order to reduce the size of the
   srcu_struct structure

 - Torture-test updates fixing some bugs in tests and closing some
   testing holes

 - Torture-test updates for the RCU tasks flavors, most notably ensuring
   that building rcutorture and friends does not change the
   RCU-tasks-related Kconfig options

 - Torture-test scripting updates

 - Expedited grace-period updates, most notably providing
   milliseconds-scale (not all that) soft real-time response from
   synchronize_rcu_expedited().

   This is also the first time in almost 30 years of RCU that someone
   other than me has pushed for a reduction in the RCU CPU stall-warning
   timeout, in this case by more than three orders of magnitude from 21
   seconds to 20 milliseconds. This tighter timeout applies only to
   expedited grace periods

* tag 'rcu.2022.05.19a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits)
  rcu: Move expedited grace period (GP) work to RT kthread_worker
  rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUT
  srcu: Drop needless initialization of sdp in srcu_gp_start()
  srcu: Prevent expedited GPs and blocking readers from consuming CPU
  srcu: Add contention check to call_srcu() srcu_data ->lock acquisition
  srcu: Automatically determine size-transition strategy at boot
  rcutorture: Make torture.sh allow for --kasan
  rcutorture: Make torture.sh refscale and rcuscale specify Tasks Trace RCU
  rcutorture: Make kvm.sh allow more memory for --kasan runs
  torture: Save "make allmodconfig" .config file
  scftorture: Remove extraneous "scf" from per_version_boot_params
  rcutorture: Adjust scenarios' Kconfig options for CONFIG_PREEMPT_DYNAMIC
  torture: Enable CSD-lock stall reports for scftorture
  torture: Skip vmlinux check for kvm-again.sh runs
  scftorture: Adjust for TASKS_RCU Kconfig option being selected
  rcuscale: Allow rcuscale without RCU Tasks Rude/Trace
  rcuscale: Allow rcuscale without RCU Tasks
  refscale: Allow refscale without RCU Tasks Rude/Trace
  refscale: Allow refscale without RCU Tasks
  rcutorture: Allow specifying per-scenario stat_interval
  ...
2022-05-23 11:46:51 -07:00
Peter Zijlstra
31cae1eaae sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
Currently ptrace_stop() / do_signal_stop() rely on the special states
TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
state exists only in task->__state and nowhere else.

There's two spots of bother with this:

 - PREEMPT_RT has task->saved_state which complicates matters,
   meaning task_is_{traced,stopped}() needs to check an additional
   variable.

 - An alternative freezer implementation that itself relies on a
   special TASK state would loose TASK_TRACED/TASK_STOPPED and will
   result in misbehaviour.

As such, add additional state to task->jobctl to track this state
outside of task->__state.

NOTE: this doesn't actually fix anything yet, just adds extra state.

--EWB
  * didn't add a unnecessary newline in signal.h
  * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
    instead of in signal_wake_up_state.  This prevents the clearing
    of TASK_STOPPED and TASK_TRACED from getting lost.
  * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220421150654.757693825@infradead.org
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20220505182645.497868-12-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2022-05-11 14:37:06 -05:00
Eric W. Biederman
2500ad1c7f ptrace: Don't change __state
Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
command is executing.

Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and
implement a new jobctl flag TASK_PTRACE_FROZEN.  This new flag is set
in jobctl_freeze_task and cleared when ptrace_stop is awoken or in
jobctl_unfreeze_task (when ptrace_stop remains asleep).

In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
when the wake up is for a fatal signal.  Skip adding __TASK_TRACED
when TASK_PTRACE_FROZEN is not set.  This has the same effect as
changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use
TASK_KILLABLE go through signal_wake_up.

Handle a ptrace_stop being called with a pending fatal signal.
Previously it would have been handled by schedule simply failing to
sleep.  As TASK_WAKEKILL is no longer part of TASK_TRACED schedule
will sleep with a fatal_signal_pending.   The code in signal_wake_up
guarantees that the code will be awaked by any fatal signal that
codes after TASK_TRACED is set.

Previously the __state value of __TASK_TRACED was changed to
TASK_RUNNING when woken up or back to TASK_TRACED when the code was
left in ptrace_stop.  Now when woken up ptrace_stop now clears
JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
clears JOBCTL_PTRACE_FROZEN.

Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-05-11 14:35:32 -05:00
Ingo Molnar
d70522fc54 Merge tag 'v5.18-rc5' into sched/core to pull in fixes & to resolve a conflict
- sched/core is on a pretty old -rc1 base - refresh it to include recent fixes.
 - this also allows up to resolve a (trivial) .mailmap conflict

Conflicts:
	.mailmap

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-05-06 10:21:46 +02:00
Thomas Gleixner
d664e39912 sched: Fix missing prototype warnings
A W=1 build emits more than a dozen missing prototype warnings related to
scheduler and scheduler specific includes.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220413133024.249118058@linutronix.de
2022-05-01 10:03:43 +02:00
Dietmar Eggemann
50e7b416d2 sched/fair: Remove sched_trace_*() helper functions
We no longer need them as we can use DWARF debug info or BTF + pahole to
re-generate the required structs to compile against them for a given
kernel.

This moves the burden of maintaining these helper functions to the
module.

	https://github.com/qais-yousef/sched_tp

Note that pahole v1.15 is required at least for using DWARF. And for BTF
v1.23 which is not yet released will be required. There's alignment
problem that will lead to crashes in earlier versions when used with
BTF.

We should have enough infrastructure to make these helper functions now
obsolete, so remove them.

[Rewrote commit message to reflect the new alternative]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220428144338.479094-2-qais.yousef@arm.com
2022-04-29 11:06:29 +02:00
Tony Luck
b041b525da x86/split_lock: Make life miserable for split lockers
In https://lore.kernel.org/all/87y22uujkm.ffs@tglx/ Thomas
said:

  Its's simply wishful thinking that stuff gets fixed because of a
  WARN_ONCE(). This has never worked. The only thing which works is to
  make stuff fail hard or slow it down in a way which makes it annoying
  enough to users to complain.

He was talking about WBINVD. But it made me think about how we use the
split lock detection feature in Linux.

Existing code has three options for applications:

 1) Don't enable split lock detection (allow arbitrary split locks)
 2) Warn once when a process uses split lock, but let the process
    keep running with split lock detection disabled
 3) Kill process that use split locks

Option 2 falls into the "wishful thinking" territory that Thomas warns does
nothing. But option 3 might not be viable in a situation with legacy
applications that need to run.

Hence make option 2 much stricter to "slow it down in a way which makes
it annoying".

Primary reason for this change is to provide better quality of service to
the rest of the applications running on the system. Internal testing shows
that even with many processes splitting locks, performance for the rest of
the system is much more responsive.

The new "warn" mode operates like this.  When an application tries to
execute a bus lock the #AC handler.

 1) Delays (interruptibly) 10 ms before moving to next step.

 2) Blocks (interruptibly) until it can get the semaphore
	If interrupted, just return. Assume the signal will either
	kill the task, or direct execution away from the instruction
	that is trying to get the bus lock.
 3) Disables split lock detection for the current core
 4) Schedules a work queue to re-enable split lock detect in 2 jiffies
 5) Returns

The work queue that re-enables split lock detection also releases the
semaphore.

There is a corner case where a CPU may be taken offline while split lock
detection is disabled. A CPU hotplug handler handles this case.

Old behaviour was to only print the split lock warning on the first
occurrence of a split lock from a task. Preserve that by adding a flag to
the task structure that suppresses subsequent split lock messages from that
task.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220310204854.31752-2-tony.luck@intel.com
2022-04-27 15:43:38 +02:00
Nico Pache
e4a38402c3 oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
can be targeted by the oom reaper.  This mapping is used to store the
futex robust list head; the kernel does not keep a copy of the robust
list and instead references a userspace address to maintain the
robustness during a process death.

A race can occur between exit_mm and the oom reaper that allows the oom
reaper to free the memory of the futex robust list before the exit path
has handled the futex death:

    CPU1                               CPU2
    --------------------------------------------------------------------
    page_fault
    do_exit "signal"
    wake_oom_reaper
                                        oom_reaper
                                        oom_reap_task_mm (invalidates mm)
    exit_mm
    exit_mm_release
    futex_exit_release
    futex_cleanup
    exit_robust_list
    get_user (EFAULT- can't access memory)

If the get_user EFAULT's, the kernel will be unable to recover the
waiters on the robust_list, leaving userspace mutexes hung indefinitely.

Delay the OOM reaper, allowing more time for the exit path to perform
the futex cleanup.

Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

Based on a patch by Michal Hocko.

Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1]
Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
Fixes: 2129258024 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Co-developed-by: Joel Savitz <jsavitz@redhat.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Savitz <jsavitz@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21 20:01:10 -07:00
Valentin Schneider
cfe43f478b preempt/dynamic: Introduce preemption model accessors
CONFIG_PREEMPT{_NONE, _VOLUNTARY} designate either:
o The build-time preemption model when !PREEMPT_DYNAMIC
o The default boot-time preemption model when PREEMPT_DYNAMIC

IOW, using those on PREEMPT_DYNAMIC kernels is meaningless - the actual
model could have been set to something else by the "preempt=foo" cmdline
parameter. Same problem applies to CONFIG_PREEMPTION.

Introduce a set of helpers to determine the actual preemption model used by
the live kernel.

Suggested-by: Marco Elver <elver@google.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20211112185203.280040-3-valentin.schneider@arm.com
2022-04-05 10:24:42 +02:00
Thomas Gleixner
7dd5ad2d3e Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"
Revert commit bf9ad37dc8. It needs to be better encapsulated and
generalized.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2022-03-31 10:36:55 +02:00
Linus Torvalds
169e77764a Merge tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
 "The sprinkling of SPI drivers is because we added a new one and Mark
  sent us a SPI driver interface conversion pull request.

  Core
  ----

   - Introduce XDP multi-buffer support, allowing the use of XDP with
     jumbo frame MTUs and combination with Rx coalescing offloads (LRO).

   - Speed up netns dismantling (5x) and lower the memory cost a little.
     Remove unnecessary per-netns sockets. Scope some lists to a netns.
     Cut down RCU syncing. Use batch methods. Allow netdev registration
     to complete out of order.

   - Support distinguishing timestamp types (ingress vs egress) and
     maintaining them across packet scrubbing points (e.g. redirect).

   - Continue the work of annotating packet drop reasons throughout the
     stack.

   - Switch netdev error counters from an atomic to dynamically
     allocated per-CPU counters.

   - Rework a few preempt_disable(), local_irq_save() and busy waiting
     sections problematic on PREEMPT_RT.

   - Extend the ref_tracker to allow catching use-after-free bugs.

  BPF
  ---

   - Introduce "packing allocator" for BPF JIT images. JITed code is
     marked read only, and used to be allocated at page granularity.
     Custom allocator allows for more efficient memory use, lower iTLB
     pressure and prevents identity mapping huge pages from getting
     split.

   - Make use of BTF type annotations (e.g. __user, __percpu) to enforce
     the correct probe read access method, add appropriate helpers.

   - Convert the BPF preload to use light skeleton and drop the
     user-mode-driver dependency.

   - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
     its use as a packet generator.

   - Allow local storage memory to be allocated with GFP_KERNEL if
     called from a hook allowed to sleep.

   - Introduce fprobe (multi kprobe) to speed up mass attachment (arch
     bits to come later).

   - Add unstable conntrack lookup helpers for BPF by using the BPF
     kfunc infra.

   - Allow cgroup BPF progs to return custom errors to user space.

   - Add support for AF_UNIX iterator batching.

   - Allow iterator programs to use sleepable helpers.

   - Support JIT of add, and, or, xor and xchg atomic ops on arm64.

   - Add BTFGen support to bpftool which allows to use CO-RE in kernels
     without BTF info.

   - Large number of libbpf API improvements, cleanups and deprecations.

  Protocols
  ---------

   - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.

   - Adjust TSO packet sizes based on min_rtt, allowing very low latency
     links (data centers) to always send full-sized TSO super-frames.

   - Make IPv6 flow label changes (AKA hash rethink) more configurable,
     via sysctl and setsockopt. Distinguish between server and client
     behavior.

   - VxLAN support to "collect metadata" devices to terminate only
     configured VNIs. This is similar to VLAN filtering in the bridge.

   - Support inserting IPv6 IOAM information to a fraction of frames.

   - Add protocol attribute to IP addresses to allow identifying where
     given address comes from (kernel-generated, DHCP etc.)

   - Support setting socket and IPv6 options via cmsg on ping6 sockets.

   - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
     Define dscp_t and stop taking ECN bits into account in fib-rules.

   - Add support for locked bridge ports (for 802.1X).

   - tun: support NAPI for packets received from batched XDP buffs,
     doubling the performance in some scenarios.

   - IPv6 extension header handling in Open vSwitch.

   - Support IPv6 control message load balancing in bonding, prevent
     neighbor solicitation and advertisement from using the wrong port.
     Support NS/NA monitor selection similar to existing ARP monitor.

   - SMC
      - improve performance with TCP_CORK and sendfile()
      - support auto-corking
      - support TCP_NODELAY

   - MCTP (Management Component Transport Protocol)
      - add user space tag control interface
      - I2C binding driver (as specified by DMTF DSP0237)

   - Multi-BSSID beacon handling in AP mode for WiFi.

   - Bluetooth:
      - handle MSFT Monitor Device Event
      - add MGMT Adv Monitor Device Found/Lost events

   - Multi-Path TCP:
      - add support for the SO_SNDTIMEO socket option
      - lots of selftest cleanups and improvements

   - Increase the max PDU size in CAN ISOTP to 64 kB.

  Driver API
  ----------

   - Add HW counters for SW netdevs, a mechanism for devices which
     offload packet forwarding to report packet statistics back to
     software interfaces such as tunnels.

   - Select the default NIC queue count as a fraction of number of
     physical CPU cores, instead of hard-coding to 8.

   - Expose devlink instance locks to drivers. Allow device layer of
     drivers to use that lock directly instead of creating their own
     which always runs into ordering issues in devlink callbacks.

   - Add header/data split indication to guide user space enabling of
     TCP zero-copy Rx.

   - Allow configuring completion queue event size.

   - Refactor page_pool to enable fragmenting after allocation.

   - Add allocation and page reuse statistics to page_pool.

   - Improve Multiple Spanning Trees support in the bridge to allow
     reuse of topologies across VLANs, saving HW resources in switches.

   - DSA (Distributed Switch Architecture):
      - replay and offload of host VLAN entries
      - offload of static and local FDB entries on LAG interfaces
      - FDB isolation and unicast filtering

  New hardware / drivers
  ----------------------

   - Ethernet:
      - LAN937x T1 PHYs
      - Davicom DM9051 SPI NIC driver
      - Realtek RTL8367S, RTL8367RB-VB switch and MDIO
      - Microchip ksz8563 switches
      - Netronome NFP3800 SmartNICs
      - Fungible SmartNICs
      - MediaTek MT8195 switches

   - WiFi:
      - mt76: MediaTek mt7916
      - mt76: MediaTek mt7921u USB adapters
      - brcmfmac: Broadcom BCM43454/6

   - Mobile:
      - iosm: Intel M.2 7360 WWAN card

  Drivers
  -------

   - Convert many drivers to the new phylink API built for split PCS
     designs but also simplifying other cases.

   - Intel Ethernet NICs:
      - add TTY for GNSS module for E810T device
      - improve AF_XDP performance
      - GTP-C and GTP-U filter offload
      - QinQ VLAN support

   - Mellanox Ethernet NICs (mlx5):
      - support xdp->data_meta
      - multi-buffer XDP
      - offload tc push_eth and pop_eth actions

   - Netronome Ethernet NICs (nfp):
      - flow-independent tc action hardware offload (police / meter)
      - AF_XDP

   - Other Ethernet NICs:
      - at803x: fiber and SFP support
      - xgmac: mdio: preamble suppression and custom MDC frequencies
      - r8169: enable ASPM L1.2 if system vendor flags it as safe
      - macb/gem: ZynqMP SGMII
      - hns3: add TX push mode
      - dpaa2-eth: software TSO
      - lan743x: multi-queue, mdio, SGMII, PTP
      - axienet: NAPI and GRO support

   - Mellanox Ethernet switches (mlxsw):
      - source and dest IP address rewrites
      - RJ45 ports

   - Marvell Ethernet switches (prestera):
      - basic routing offload
      - multi-chain TC ACL offload

   - NXP embedded Ethernet switches (ocelot & felix):
      - PTP over UDP with the ocelot-8021q DSA tagging protocol
      - basic QoS classification on Felix DSA switch using dcbnl
      - port mirroring for ocelot switches

   - Microchip high-speed industrial Ethernet (sparx5):
      - offloading of bridge port flooding flags
      - PTP Hardware Clock

   - Other embedded switches:
      - lan966x: PTP Hardward Clock
      - qca8k: mdio read/write operations via crafted Ethernet packets

   - Qualcomm 802.11ax WiFi (ath11k):
      - add LDPC FEC type and 802.11ax High Efficiency data in radiotap
      - enable RX PPDU stats in monitor co-exist mode

   - Intel WiFi (iwlwifi):
      - UHB TAS enablement via BIOS
      - band disablement via BIOS
      - channel switch offload
      - 32 Rx AMPDU sessions in newer devices

   - MediaTek WiFi (mt76):
      - background radar detection
      - thermal management improvements on mt7915
      - SAR support for more mt76 platforms
      - MBSSID and 6 GHz band on mt7915

   - RealTek WiFi:
      - rtw89: AP mode
      - rtw89: 160 MHz channels and 6 GHz band
      - rtw89: hardware scan

   - Bluetooth:
      - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)

   - Microchip CAN (mcp251xfd):
      - multiple RX-FIFOs and runtime configurable RX/TX rings
      - internal PLL, runtime PM handling simplification
      - improve chip detection and error handling after wakeup"

* tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits)
  llc: fix netdevice reference leaks in llc_ui_bind()
  drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool
  ice: don't allow to run ice_send_event_to_aux() in atomic ctx
  ice: fix 'scheduling while atomic' on aux critical err interrupt
  net/sched: fix incorrect vlan_push_eth dest field
  net: bridge: mst: Restrict info size queries to bridge ports
  net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init()
  drivers: net: xgene: Fix regression in CRC stripping
  net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT
  net: dsa: fix missing host-filtered multicast addresses
  net/mlx5e: Fix build warning, detected write beyond size of field
  iwlwifi: mvm: Don't fail if PPAG isn't supported
  selftests/bpf: Fix kprobe_multi test.
  Revert "rethook: x86: Add rethook x86 implementation"
  Revert "arm64: rethook: Add arm64 rethook implementation"
  Revert "powerpc: Add rethook support"
  Revert "ARM: rethook: Add rethook arm implementation"
  netdevice: add missing dm_private kdoc
  net: bridge: mst: prevent NULL deref in br_mst_info_size()
  selftests: forwarding: Use same VRF for port and VLAN upper
  ...
2022-03-24 13:13:26 -07:00
Linus Torvalds
3bf03b9a08 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs

 - Most the MM patches which precede the patches in Willy's tree: kasan,
   pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
   sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb,
   userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp,
   cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap,
   zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits)
  mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()
  Docs/ABI/testing: add DAMON sysfs interface ABI document
  Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface
  selftests/damon: add a test for DAMON sysfs interface
  mm/damon/sysfs: support DAMOS stats
  mm/damon/sysfs: support DAMOS watermarks
  mm/damon/sysfs: support schemes prioritization
  mm/damon/sysfs: support DAMOS quotas
  mm/damon/sysfs: support DAMON-based Operation Schemes
  mm/damon/sysfs: support the physical address space monitoring
  mm/damon/sysfs: link DAMON for virtual address spaces monitoring
  mm/damon: implement a minimal stub for sysfs-based DAMON interface
  mm/damon/core: add number of each enum type values
  mm/damon/core: allow non-exclusive DAMON start/stop
  Docs/damon: update outdated term 'regions update interval'
  Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling
  Docs/vm/damon: call low level monitoring primitives the operations
  mm/damon: remove unnecessary CONFIG_DAMON option
  mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}()
  mm/damon/dbgfs-test: fix is_target_id() change
  ...
2022-03-22 16:11:53 -07:00
Hugh Dickins
b698f0a177 mm/fs: delete PF_SWAPWRITE
PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8e ("mm:
vmscan: do not writeback filesystem pages in direct reclaim").

Coincidentally, NeilBrown's current patch "remove inode_congested()"
deletes may_write_to_inode(), which appeared to be the one function which
took notice of PF_SWAPWRITE.  But if you study the old logic, and the
conditions under which may_write_to_inode() was called, you discover that
flag and function have been pointless for a decade.

Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Jan Kara <jack@suse.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:08 -07:00
Linus Torvalds
3fe2f7446f Merge tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Cleanups for SCHED_DEADLINE

 - Tracing updates/fixes

 - CPU Accounting fixes

 - First wave of changes to optimize the overhead of the scheduler
   build, from the fast-headers tree - including placeholder *_api.h
   headers for later header split-ups.

 - Preempt-dynamic using static_branch() for ARM64

 - Isolation housekeeping mask rework; preperatory for further changes

 - NUMA-balancing: deal with CPU-less nodes

 - NUMA-balancing: tune systems that have multiple LLC cache domains per
   node (eg. AMD)

 - Updates to RSEQ UAPI in preparation for glibc usage

 - Lots of RSEQ/selftests, for same

 - Add Suren as PSI co-maintainer

* tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
  sched/headers: ARM needs asm/paravirt_api_clock.h too
  sched/numa: Fix boot crash on arm64 systems
  headers/prep: Fix header to build standalone: <linux/psi.h>
  sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y
  cgroup: Fix suspicious rcu_dereference_check() usage warning
  sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers
  sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains
  sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()
  sched/deadline,rt: Remove unused functions for !CONFIG_SMP
  sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently
  sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()
  sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file
  sched/deadline: Remove unused def_dl_bandwidth
  sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
  sched/tracing: Don't re-read p->state when emitting sched_switch event
  sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
  sched/cpuacct: Remove redundant RCU read lock
  sched/cpuacct: Optimize away RCU read lock
  sched/cpuacct: Fix charge percpu cpuusage
  sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
  ...
2022-03-22 14:39:12 -07:00
Linus Torvalds
bba90e0964 Merge tag 'core-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core process handling RT latency updates from Thomas Gleixner:

 - Reduce the amount of work to release a task stack in context switch.
   There is no real reason to do cgroup accounting and memory freeing in
   this performance sensitive context.

   Aside of this the invoked functions cannot be called from this
   preemption disabled context on PREEMPT_RT enabled kernels. Solve this
   by moving the accounting into do_exit() and delaying the freeing of
   the stack unless the vmap stack can be cached.

 - Provide a mechanism to delay raising signals from atomic context on
   PREEMPT_RT enabled kernels as sighand::lock cannot be acquired. Store
   the information in the task struct and raise it in the exit path.

* tag 'core-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  signal, x86: Delay calling signals in atomic on RT enabled kernels
  fork: Use IS_ENABLED() in account_kernel_stack()
  fork: Only cache the VMAP stack in finish_task_switch()
  fork: Move task stack accounting to do_exit()
  fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK
  fork: Don't assign the stack pointer in dup_task_struct()
  fork, IA64: Provide alloc_thread_stack_node() for IA64
  fork: Duplicate task_struct before stack allocation
  fork: Redo ifdefs around task stack handling
2022-03-21 12:37:33 -07:00
Masami Hiramatsu
54ecbe6f1e rethook: Add a generic return hook
Add a return hook framework which hooks the function return. Most of the
logic came from the kretprobe, but this is independent from kretprobe.

Note that this is expected to be used with other function entry hooking
feature, like ftrace, fprobe, adn kprobes. Eventually this will replace
the kretprobe (e.g. kprobe + rethook = kretprobe), but at this moment,
this is just an additional hook.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735285066.1084943.9259661137330166643.stgit@devnote2
2022-03-17 20:16:29 -07:00
Oleg Nesterov
bf9ad37dc8 signal, x86: Delay calling signals in atomic on RT enabled kernels
On x86_64 we must disable preemption before we enable interrupts
for stack faults, int3 and debugging, because the current task is using
a per CPU debug stack defined by the IST. If we schedule out, another task
can come in and use the same stack and cause the stack to be corrupted
and crash the kernel on return.

When CONFIG_PREEMPT_RT is enabled, spinlock_t locks become sleeping, and
one of these is the spin lock used in signal handling.

Some of the debug code (int3) causes do_trap() to send a signal.
This function calls a spinlock_t lock that has been converted to a
sleeping lock. If this happens, the above issues with the corrupted
stack is possible.

Instead of calling the signal right away, for PREEMPT_RT and x86,
the signal information is stored on the stacks task_struct and
TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume
code will send the signal when preemption is enabled.

[ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to
  ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ]
[bigeasy: Add on 32bit as per Yang Shi, minor rewording. ]
[ tglx: Use a config option ]

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/Ygq5aBB/qMQw6aP5@linutronix.de
2022-03-04 14:58:54 +01:00
Valentin Schneider
25795ef629 sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.

The actual state is saved in p->saved_state, but reading it after reading
p->__state has a few issues:
o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
o ttwu_state_match() might have changed that to TASK_RUNNING

As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
exposing a new state to userspace tools which way not know what to do with
them. The only information that needs to be conveyed here is that a task is
waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
need for a new state.

Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220120162520.570782-3-valentin.schneider@arm.com
2022-03-01 16:18:39 +01:00
Valentin Schneider
fa2c3254d7 sched/tracing: Don't re-read p->state when emitting sched_switch event
As of commit

  c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")

the following sequence becomes possible:

		      p->__state = TASK_INTERRUPTIBLE;
		      __schedule()
			deactivate_task(p);
  ttwu()
    READ !p->on_rq
    p->__state=TASK_WAKING
			trace_sched_switch()
			  __trace_sched_switch_state()
			    task_state_index()
			      return 0;

TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
the trace event.

Prevent this by pushing the value read from __schedule() down the trace
event.

Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
2022-03-01 16:18:39 +01:00
Ingo Molnar
6255b48aeb Merge tag 'v5.17-rc5' into sched/core, to resolve conflicts
New conflicts in sched/core due to the following upstream fixes:

  44585f7bc0 ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n")
  a06247c680 ("psi: Fix uaf issue when psi trigger is destroyed while being polled")

Conflicts:
	include/linux/psi_types.h
	kernel/sched/psi.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-02-21 11:53:51 +01:00
Mark Rutland
99cf983cc8 sched/preempt: Add PREEMPT_DYNAMIC using static keys
Where an architecture selects HAVE_STATIC_CALL but not
HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline
which will either branch to a callee or return to the caller.

On such architectures, a number of constraints can conspire to make
those trampolines more complicated and potentially less useful than we'd
like. For example:

* Hardware and software control flow integrity schemes can require the
  addition of "landing pad" instructions (e.g. `BTI` for arm64), which
  will also be present at the "real" callee.

* Limited branch ranges can require that trampolines generate or load an
  address into a register and perform an indirect branch (or at least
  have a slow path that does so). This loses some of the benefits of
  having a direct branch.

* Interaction with SW CFI schemes can be complicated and fragile, e.g.
  requiring that we can recognise idiomatic codegen and remove
  indirections understand, at least until clang proves more helpful
  mechanisms for dealing with this.

For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we
really only need to enable/disable specific preemption functions. We can
achieve the same effect without a number of the pain points above by
using static keys to fold early returns into the preemption functions
themselves rather than in an out-of-line trampoline, effectively
inlining the trampoline into the start of the function.

For arm64, this results in good code generation. For example, the
dynamic_cond_resched() wrapper looks as follows when enabled. When
disabled, the first `B` is replaced with a `NOP`, resulting in an early
return.

| <dynamic_cond_resched>:
|        bti     c
|        b       <dynamic_cond_resched+0x10>     // or `nop`
|        mov     w0, #0x0
|        ret
|        mrs     x0, sp_el0
|        ldr     x0, [x0, #8]
|        cbnz    x0, <dynamic_cond_resched+0x8>
|        paciasp
|        stp     x29, x30, [sp, #-16]!
|        mov     x29, sp
|        bl      <preempt_schedule_common>
|        mov     w0, #0x1
|        ldp     x29, x30, [sp], #16
|        autiasp
|        ret

... compared to the regular form of the function:

| <__cond_resched>:
|        bti     c
|        mrs     x0, sp_el0
|        ldr     x1, [x0, #8]
|        cbz     x1, <__cond_resched+0x18>
|        mov     w0, #0x0
|        ret
|        paciasp
|        stp     x29, x30, [sp, #-16]!
|        mov     x29, sp
|        bl      <preempt_schedule_common>
|        mov     w0, #0x1
|        ldp     x29, x30, [sp], #16
|        autiasp
|        ret

Any architecture which implements static keys should be able to use this
to implement PREEMPT_DYNAMIC with similar cost to non-inlined static
calls. Since this is likely to have greater overhead than (inlined)
static calls, PREEMPT_DYNAMIC is only defaulted to enabled when
HAVE_PREEMPT_DYNAMIC_CALL is selected.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com
2022-02-19 11:11:08 +01:00
Peter Zijlstra
a3d29e8291 sched: Define and initialize a flag to identify valid PASID in the task
Add a new single bit field to the task structure to track whether this task
has initialized the IA32_PASID MSR to the mm's PASID.

Initialize the field to zero when creating a new task with fork/clone.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-8-fenghua.yu@intel.com
2022-02-15 11:31:43 +01:00